Created 07-27-2016 09:17 AM
I have a use case where I need to read files from a folder in Unix and write the data into Hadoop File System. Files will be generated in the folder by a downstream process real-time. Once a file has been generated the data should be moved into Hadoop. I am using Apache Kafka for the process. I need to know how to implement this use case.
Created 07-27-2016 09:34 AM
You really don't want to use Kafka for this. Kafka is limited to very small objects, and you'd be writing a lot of boilerplate code.
NiFi would provide a much simpler and better solution here. No file size limitations, and all the work has already been done for you.
You would need to use GetFile on the unix folder (or ListFile->FetchFile if you want to keep files in the folder) and then PutHDFS. That will do everything you need. You can also use NiFi to MergeContent to batch up files if required. This can help with NameNode memory pressure and efficiency of downstream processing.
If you really feel like you must use Kafka, you're going to jump through a lot of hoops, write a custom Producer and file handlers. You could then use something like a storm topology with the storm-hdfs bolt to write out to HDFS, or write a manual Consumer with the Hadoop apis to write the file, but honestly, that's going to take you a lot of time vs the simple NiFi solution.
Created 07-27-2016 09:34 AM
You really don't want to use Kafka for this. Kafka is limited to very small objects, and you'd be writing a lot of boilerplate code.
NiFi would provide a much simpler and better solution here. No file size limitations, and all the work has already been done for you.
You would need to use GetFile on the unix folder (or ListFile->FetchFile if you want to keep files in the folder) and then PutHDFS. That will do everything you need. You can also use NiFi to MergeContent to batch up files if required. This can help with NameNode memory pressure and efficiency of downstream processing.
If you really feel like you must use Kafka, you're going to jump through a lot of hoops, write a custom Producer and file handlers. You could then use something like a storm topology with the storm-hdfs bolt to write out to HDFS, or write a manual Consumer with the Hadoop apis to write the file, but honestly, that's going to take you a lot of time vs the simple NiFi solution.
Created 07-27-2016 09:54 AM
How to read only the incremental data using NIFI?I mean the creation of files in the folder will be a continuous process but once I already moved a file I don't want to move the same file next time around.Will that be possible in NIFI?
Coming back to Kafka if I write custom producer will it be possible to move only incremental data or the new files only because the file creation will be a continuous process.Also could be mention the class to look for if I want to write custom producer to read from a path instead of standard input steam?
Created 07-27-2016 10:11 AM
If you write a producer of your own, you will also have to write a state mechanism for incremental processing. Kafka will not in anyway help you here.
If you use NiFi's ListFile processor, that will only process new files, or updated files since its last run, so will naturally give you incremental processing. Use ListFile -> FetchFile -> PutHDFS to get what you're after.
Created 07-27-2016 11:45 AM
Could you explain what do you mean by state mechanism.How I need to implement that in producer or before the data is read by the producer?
Created 10-28-2016 11:21 AM
Hi INDRANIL ROY,
Are you able to get the continuously streaming data (flat file) into hadoop. What are the ecosystems you have used to get the real time data into hadoop. Please provide the ecosystems details or the steps you followed to get the flat files in to hadoop.
 
					
				
				
			
		
