Support Questions
Find answers, ask questions, and share your expertise

Ways to get data from Kafka to HDFS


I'm looking for ways to get data from Kafka to Python.

Currently I'm using this pipeline. Has anyone faced issues with using Flume?

 Flume(exec-source and Kafka-sink) --> Kafka --> Flume(kafka-source and HDFS-sink)

Other options: In case I have a kafka-consumer written, is there a python way of getting the data from Kafka consumer to HDFS (other than Confluent's Connect API)?

Or are there any other means I can get the data from Kafka t HDFS?


Expert Contributor

Hi Swaapnika, I've tried using Flume for that and had no issues.

Investigate this repository for python This is the most exhaustive one I guess.


I see Flume is deprecated and will be removed from HDP in the future releases as mentioned in the HDP-2.6.2-Release Notes. Are there any other techniques that could be used with kafka to get data into HDFS?

Super Collaborator

@Swaapnika Guntaka You could use Spark Streaming in PySpark to consume a topic and write the data to HDFS.

You could also use HDF with NiFi and skip Python entirely.

Also, this is a Python client, by Confluent, not related to Kafka Connect.


Is there a difference between the kafka-connector in the python module and the confluent's one? This is the gihub link for the one mentioned in the python module,

Super Collaborator

Confluent is the support company for Kafka. I personally would trust their code more than someone else's.