Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Using Apache Kafka to read files from a folder in a UNIX file system and write into Hadoop file system

avatar
Rising Star

I have a use case where I need to read files from a folder in Unix and write the data into Hadoop File System. Files will be generated in the folder by a downstream process real-time. Once a file has been generated the data should be moved into Hadoop. I am using Apache Kafka for the process. I need to know how to implement this use case.

  1. How to read only the newly created files from the folder using the Kafka producer?(Any examples/Java Classes to use)
  2. How to write the consumer to write the files into Hadoop File system?(Any examples/Java Classes to use)
  3. Is there any other technology like NIFI /Apache storm I need to use along with Kafka to obtain the results or this can be implemented entirely using Kafka?
1 ACCEPTED SOLUTION

avatar
Guru

You really don't want to use Kafka for this. Kafka is limited to very small objects, and you'd be writing a lot of boilerplate code.

NiFi would provide a much simpler and better solution here. No file size limitations, and all the work has already been done for you.

You would need to use GetFile on the unix folder (or ListFile->FetchFile if you want to keep files in the folder) and then PutHDFS. That will do everything you need. You can also use NiFi to MergeContent to batch up files if required. This can help with NameNode memory pressure and efficiency of downstream processing.

If you really feel like you must use Kafka, you're going to jump through a lot of hoops, write a custom Producer and file handlers. You could then use something like a storm topology with the storm-hdfs bolt to write out to HDFS, or write a manual Consumer with the Hadoop apis to write the file, but honestly, that's going to take you a lot of time vs the simple NiFi solution.

View solution in original post

5 REPLIES 5

avatar
Guru

You really don't want to use Kafka for this. Kafka is limited to very small objects, and you'd be writing a lot of boilerplate code.

NiFi would provide a much simpler and better solution here. No file size limitations, and all the work has already been done for you.

You would need to use GetFile on the unix folder (or ListFile->FetchFile if you want to keep files in the folder) and then PutHDFS. That will do everything you need. You can also use NiFi to MergeContent to batch up files if required. This can help with NameNode memory pressure and efficiency of downstream processing.

If you really feel like you must use Kafka, you're going to jump through a lot of hoops, write a custom Producer and file handlers. You could then use something like a storm topology with the storm-hdfs bolt to write out to HDFS, or write a manual Consumer with the Hadoop apis to write the file, but honestly, that's going to take you a lot of time vs the simple NiFi solution.

avatar
Rising Star

How to read only the incremental data using NIFI?I mean the creation of files in the folder will be a continuous process but once I already moved a file I don't want to move the same file next time around.Will that be possible in NIFI?

Coming back to Kafka if I write custom producer will it be possible to move only incremental data or the new files only because the file creation will be a continuous process.Also could be mention the class to look for if I want to write custom producer to read from a path instead of standard input steam?

avatar
Guru

If you write a producer of your own, you will also have to write a state mechanism for incremental processing. Kafka will not in anyway help you here.

If you use NiFi's ListFile processor, that will only process new files, or updated files since its last run, so will naturally give you incremental processing. Use ListFile -> FetchFile -> PutHDFS to get what you're after.

avatar
Rising Star

Could you explain what do you mean by state mechanism.How I need to implement that in producer or before the data is read by the producer?

avatar

@ INDRANIL ROY

Hi INDRANIL ROY,

Are you able to get the continuously streaming data (flat file) into hadoop. What are the ecosystems you have used to get the real time data into hadoop. Please provide the ecosystems details or the steps you followed to get the flat files in to hadoop.