05-11-2016 09:09 AM
Looking for some advice on the best way to store streaming data from Kafka into HDFS, currently using Spark Streaming at 30m intervals creates lots of small files. I have attempted to use Hive and make use of it's compaction jobs but it looks like this isn't supported when writing from Spark yet.
Any advice would be greatly appreciated.
05-11-2016 11:56 AM
You may want to investigate using flume to stream messages from kafka to HDFS:
You can use a kafka source if you need to modify the messages before they are delivered to hdfs, or you can use the kafka channel if no modification is needed.
07-04-2017 08:56 PM
I am looking for something exactly like this. Small file creation is not a problem for me as I am having a cron job which is merging the smaller files into a large one daily before deleting the smaller ones.
I am looking for spark streaming to pull records from my kafka topic to HDFS. I guess you already have implemented it. Can you provide me some insights or references as to how you achieved this?
11-23-2018 07:43 AM
Flume is likely your best option, as Patrick has pointed out. Refer to his link for further details on this.
Should you wish to use Spark Streaming, that's also possible. To integrate Kafka & Spark you can follow then below guide:
Then you can write a DStream output to save into HDFS:
11-25-2018 08:12 PM
What about :
Kafka => HBase => HDFS
or if you have Kudu, Kafka => Kudu => HDFS
So for realtime (near-realtime) data, store and access it in HBase/Kudu.
Later, you can move data from HBase/Kudu to HDFS in N-hour/daily base to avoid small file problem.
FYI, hive can sync with HBase.