Support Questions

Find answers, ask questions, and share your expertise

How to read data from HDFS and place into Kafka (don’t want to use Scala/Spark)? Any utilities or methods?

Explorer

We are looking for some kind of utility or tool to read the data from HDFS and place it in the kafka topic. Appreciate your inputs.

 

From the community section, we came across this "You could use Apache NiFi with a ListHDFS + FetchHDFS processor followed by PublishKafka"...Can you provide more insight how this can be acheived

 

Thank you
Srinu

15 REPLIES 15

Expert Contributor

Explorer

Thank you.

 

We will work on this solution and get back. Do we have the same to read data from Amazon-s3 and put into the topic?

 

 

Expert Contributor

Explorer

Same as previous request. We are looking for source connector here as well to pull the data from Amazon s3 and put in kafka.

 

Explorer

We already have the data in HDFS and we want to pull the data from HDFS and put in in kafka topic.

So, we are looking for source connector here in pulling the data from HDFS and placing in kafka.

Contributor

Hello @sriven ,

 

- As @Daming Xue mentioned Kafka Connect is one of the good options, the doc https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/kafka-connect/kafka-connect.pdf shares an example of HDFS as a sink connector.

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/kafka-connect/topics/kafka-connect-connector-...

 

- Flume (CDH)  

https://docs.cloudera.com/documentation/kafka/latest/topics/kafka_flume.html#concept_rsb_tyb_kv__sec...

 

- Nifi (https://blog.cloudera.com/adding-nifi-and-kafka-to-cloudera-data-platform/

https://community.cloudera.com/t5/Community-Articles/Integrating-Apache-NiFi-and-Apache-Kafka/ta-p/2... )

 

- Kafka- Hive Integeration (https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/integrating-hive-and-bi/topics/hive-kafka-int...)

 

- Custom Java app (https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/kafka-developing-applications/topics/kafka-de...)

 

- To try out quickly (testing purpose), you can use the console producer 

hadoop fs -cat file.txt | kafka-console-producer --broker-list <host:port>  --topic <topic>

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/kafka-managing/topics/kafka-manage-cli-produc...

 

- Spark (which you do not want)

https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/developing-spark-applications/topics/spark-us...

 

These are some I could quickly think of, there must be many more options.

 

Thanks & Regards,

Nandini

 

P.S. If you found this answer useful please upvote/accept.

 

 

SME || Kafka | Schema Registry | SMM | SRM

Explorer

@Nandinin ,

We have requirement like spark program are writing the files into HDFS.

So we want to read those files and send to kaka.

 

We know HDFS sink connector is useful for writing to HDFS as well and HDFS source connector is useful when the files written by hdfs sink connector.

HDFS source connector is also not the solution if files written by spark programming.

Please let us know if there is any solutions for this requirement?

 

 

 

Contributor

Hello,

 

What is the file format?

Why is it that you say HDFS source connector is also not the solution if files written by spark programming.?


Spark - HDFS - Kafka is your entire flow correct?
Spark to HDFS they have done now you are looking for HDFS - Kafka.

If you can help me understand the file format that Spark saves it while I can find if HDFS Source connector should not be able to help your usecase.

 

 

SME || Kafka | Schema Registry | SMM | SRM

Explorer

@Nandinin ,

Yes the flow is correct

The files are in parquet format.

 

 

Contributor

Hello @sriven 

 

Found this - https://community.cloudera.com/t5/Support-Questions/How-to-insert-parquet-file-to-Kafka-and-pass-the...

 

Please let me know if it helps.

 

Thanks & Regards,

Nandini

SME || Kafka | Schema Registry | SMM | SRM

Explorer

Hello @Nandinin ,

We have gonbe through this already.

Anything without Scala/Spark ?

 

 

Contributor

Please try Kafka connect then, that seems to be the best option suited. 

SME || Kafka | Schema Registry | SMM | SRM

Explorer

How to read parquet files using Kconnect.?

In simple,We just want to read the parquet files on HDFS using kconnect and without spark jobs?

Please let us know if there is a solution or not?

 

 

 

 

Explorer

As you know,

 

We have limitation with source kafka connector that it works for HDFS  objects/files created only by the HDFS 2 Sink Connector for Confluent Platform

and how we can pull the files if created by other spark,mapreduce or any other jobs on HDFS?

 

The use case of HDFS source connector is only to mirror the same data on kafka.

 

 

Contributor

Please try Nifi - Kakfa

 

https://community.cloudera.com/t5/Community-Articles/Apache-NiFi-1-10-Support-for-Parquet-RecordRead...

SME || Kafka | Schema Registry | SMM | SRM
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.