Reply
New Contributor
Posts: 1
Registered: ‎05-11-2016

Stream data from Kafka to HDFS

Hi, 

 

Looking for some advice on the best way to store streaming data from Kafka into HDFS, currently using Spark Streaming at 30m intervals creates lots of small files. I have attempted to use Hive and make use of it's compaction  jobs but it looks like this isn't supported when writing from Spark yet.

 

Any advice would be greatly appreciated.

 

Cloudera Employee
Posts: 198
Registered: ‎01-09-2014

Re: Stream data from Kafka to HDFS

You may want to investigate using flume to stream messages from kafka to HDFS:

 

http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/

http://www.cloudera.com/documentation/kafka/latest/topics/kafka_flume.html

 

You can use a kafka source if you need to modify the messages before they are delivered to hdfs, or you can use the kafka channel if no modification is needed.  

 

-pd

New Contributor
Posts: 1
Registered: ‎12-28-2016

Re: Stream data from Kafka to HDFS

Hi

 

I am looking for something exactly like this. Small file creation is not a problem for me as I am having a cron job which is merging the smaller files into a large one daily before deleting the smaller ones.

 

I am looking for spark streaming to pull records from my kafka topic to HDFS. I guess you already have implemented it. Can you provide me some insights or references as to how you achieved this?

Announcements