Created 06-22-2017 12:23 AM
Can kafka process multiple files and then send it to spark streaming?
Created 06-22-2017 12:47 AM
Kafka is a message broker so it only receives files/events from publishers and makes them available for consumption by consumers. It does not do any processing.
Spark streaming would dictate how files/events are read. Since Spark Streaming does micro-batching it will read several files/events from Kafka and process them together in a micro-batch.
I believe this will achieve what you are asking to do, it'll be on the Spark side though, not Kafka.
As always, if you find this post helpful, don't forget to "accept" answer.
Created 06-22-2017 12:47 AM
Kafka is a message broker so it only receives files/events from publishers and makes them available for consumption by consumers. It does not do any processing.
Spark streaming would dictate how files/events are read. Since Spark Streaming does micro-batching it will read several files/events from Kafka and process them together in a micro-batch.
I believe this will achieve what you are asking to do, it'll be on the Spark side though, not Kafka.
As always, if you find this post helpful, don't forget to "accept" answer.
Created 06-27-2017 03:25 AM
Meaning, I should go straight to Spark to process multiple files.
Created 06-23-2017 06:47 AM
Hello @mel mendoza ,
Kafka is basically not a file based systems, but event based. If you want to process files with Spark-Streaming via Kafka you have a 2-step approach. First is ingest to Kafka, then consume the events from Kafka by Spark-Streaming.
To ingest into Kafka you can e.g. use Kafka-Connect with the file source (check /usr/hdp/current/kafka-broker/conf/connect-file-source.properties). It works like a "tail -f " on that file and streams any incoming data from that file to the Kafka topic.
Afterwards you have to consume the events from that Kafka topic with your Spark-Streaming job.
HTH, Gerd
Created 06-27-2017 06:35 AM
Thanks @Gerd Koenig !
For multiple files processing what application/tech should you recommend, process in realtime?
Created 06-27-2017 07:12 AM
Hi @mel mendoza ,
maybe it is worth checking Flume to ingest multiple files to Kafka. Alternatively you can use HDF (particularly NiFi) to do so.
Created 06-27-2017 08:17 AM
Thanks again! I'm currently using NiFi for data collection. will try NiFi to kafka