Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can Kafka process multiple files?

avatar
Contributor

Can kafka process multiple files and then send it to spark streaming?

1 ACCEPTED SOLUTION

avatar

@mel mendoza

Kafka is a message broker so it only receives files/events from publishers and makes them available for consumption by consumers. It does not do any processing.

Spark streaming would dictate how files/events are read. Since Spark Streaming does micro-batching it will read several files/events from Kafka and process them together in a micro-batch.

I believe this will achieve what you are asking to do, it'll be on the Spark side though, not Kafka.

As always, if you find this post helpful, don't forget to "accept" answer.

View solution in original post

6 REPLIES 6

avatar

@mel mendoza

Kafka is a message broker so it only receives files/events from publishers and makes them available for consumption by consumers. It does not do any processing.

Spark streaming would dictate how files/events are read. Since Spark Streaming does micro-batching it will read several files/events from Kafka and process them together in a micro-batch.

I believe this will achieve what you are asking to do, it'll be on the Spark side though, not Kafka.

As always, if you find this post helpful, don't forget to "accept" answer.

avatar
Contributor

@Eyad Garelnabi

Meaning, I should go straight to Spark to process multiple files.

avatar
Guru

Hello @mel mendoza ,

Kafka is basically not a file based systems, but event based. If you want to process files with Spark-Streaming via Kafka you have a 2-step approach. First is ingest to Kafka, then consume the events from Kafka by Spark-Streaming.

To ingest into Kafka you can e.g. use Kafka-Connect with the file source (check /usr/hdp/current/kafka-broker/conf/connect-file-source.properties). It works like a "tail -f " on that file and streams any incoming data from that file to the Kafka topic.

Afterwards you have to consume the events from that Kafka topic with your Spark-Streaming job.

HTH, Gerd

avatar
Contributor

Thanks @Gerd Koenig !

For multiple files processing what application/tech should you recommend, process in realtime?

avatar
Guru

Hi @mel mendoza ,

maybe it is worth checking Flume to ingest multiple files to Kafka. Alternatively you can use HDF (particularly NiFi) to do so.

avatar
Contributor

Thanks again! I'm currently using NiFi for data collection. will try NiFi to kafka