05-11-2016 09:09 AM
Looking for some advice on the best way to store streaming data from Kafka into HDFS, currently using Spark Streaming at 30m intervals creates lots of small files. I have attempted to use Hive and make use of it's compaction jobs but it looks like this isn't supported when writing from Spark yet.
Any advice would be greatly appreciated.
05-11-2016 11:56 AM
You may want to investigate using flume to stream messages from kafka to HDFS:
You can use a kafka source if you need to modify the messages before they are delivered to hdfs, or you can use the kafka channel if no modification is needed.
07-04-2017 08:56 PM
I am looking for something exactly like this. Small file creation is not a problem for me as I am having a cron job which is merging the smaller files into a large one daily before deleting the smaller ones.
I am looking for spark streaming to pull records from my kafka topic to HDFS. I guess you already have implemented it. Can you provide me some insights or references as to how you achieved this?