10-05-2016 04:27 PM
I am very new to Big data and have a question about how Flume and kafka work with files.
We have a number of systems producing small CSV files anywhere between 5-100MB on a nightly basis. We were looking at loading these to HDFS via Flafka. I have looked at using SpoolingDir to monitor a NFS which the file will land, these would then be written to a Kafka Channel then use a HDFS sink to write these files to HDFS. Our idea was to create a Kafka topic per system providing the CSV files.
Im trying to understand how these CSV files will be stored on the Topic, the Kafka documentaion talks about the concept of of Messages.
When using SpoolingDir is a message considered as a complete source file or is a message a single row in this source file?
The reason i ask is that i am trying to understand what roll settings i should be using.
10-05-2016 04:45 PM
10-05-2016 04:51 PM - edited 10-05-2016 04:52 PM
Flume is based on events. So a single line will be considered a message.
EDIT: What Harsh Said. He's never wrong.
10-05-2016 06:08 PM
thanks for the great response. Makes it very clear.
I am assuming now when I read the Kafka topic with HDFS sink files on HDFS will be created/written based on the roll interval/size that I use and not the original file, unless of course i use the Blobdeserializer.
10-05-2016 06:54 PM