New Contributor
Posts: 1
Registered: ‎05-11-2016

Stream data from Kafka to HDFS



Looking for some advice on the best way to store streaming data from Kafka into HDFS, currently using Spark Streaming at 30m intervals creates lots of small files. I have attempted to use Hive and make use of it's compaction  jobs but it looks like this isn't supported when writing from Spark yet.


Any advice would be greatly appreciated.


Cloudera Employee
Posts: 275
Registered: ‎01-09-2014

Re: Stream data from Kafka to HDFS

You may want to investigate using flume to stream messages from kafka to HDFS:


You can use a kafka source if you need to modify the messages before they are delivered to hdfs, or you can use the kafka channel if no modification is needed.  



New Contributor
Posts: 1
Registered: ‎12-28-2016

Re: Stream data from Kafka to HDFS



I am looking for something exactly like this. Small file creation is not a problem for me as I am having a cron job which is merging the smaller files into a large one daily before deleting the smaller ones.


I am looking for spark streaming to pull records from my kafka topic to HDFS. I guess you already have implemented it. Can you provide me some insights or references as to how you achieved this?

Cloudera Employee
Posts: 104
Registered: ‎07-10-2017

Re: Stream data from Kafka to HDFS

Flume is likely your best option, as Patrick has pointed out. Refer to his link for further details on this.


Should you wish to use Spark Streaming, that's also possible. To integrate Kafka & Spark you can follow then below guide:


Then you can write a DStream output to save into HDFS:

New Contributor
Posts: 5
Registered: ‎09-26-2018

Re: Stream data from Kafka to HDFS

What about :


Kafka => HBase => HDFS

or if you have Kudu, Kafka => Kudu => HDFS

So for realtime (near-realtime) data, store and access it in HBase/Kudu. 


Later, you can move data from HBase/Kudu to HDFS in N-hour/daily base to avoid small file problem.


FYI, hive can sync with HBase.

New solutions