Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Ways to get data from Kafka to HDFS

avatar
Explorer

I'm looking for ways to get data from Kafka to Python.

Currently I'm using this pipeline. Has anyone faced issues with using Flume?

 Flume(exec-source and Kafka-sink) --> Kafka --> Flume(kafka-source and HDFS-sink)

Other options: In case I have a kafka-consumer written, is there a python way of getting the data from Kafka consumer to HDFS (other than Confluent's Connect API)?

Or are there any other means I can get the data from Kafka t HDFS?

5 REPLIES 5

avatar
Super Collaborator

Hi Swaapnika, I've tried using Flume for that and had no issues.

Investigate this repository for python https://github.com/edenhill/librdkafka. This is the most exhaustive one I guess.

avatar
Explorer

I see Flume is deprecated and will be removed from HDP in the future releases as mentioned in the HDP-2.6.2-Release Notes. Are there any other techniques that could be used with kafka to get data into HDFS?

avatar
Super Collaborator

@Swaapnika Guntaka You could use Spark Streaming in PySpark to consume a topic and write the data to HDFS.

You could also use HDF with NiFi and skip Python entirely.

Also, this is a Python client, by Confluent, not related to Kafka Connect. https://github.com/confluentinc/confluent-kafka-python

avatar
Explorer

Is there a difference between the kafka-connector in the python module and the confluent's one? This is the gihub link for the one mentioned in the python module,

avatar
Super Collaborator

Confluent is the support company for Kafka. I personally would trust their code more than someone else's.