Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Flafka question

Flafka question

New Contributor



I am very new to Big data and have a question about how Flume and kafka work with files. 


We have a number of systems producing small CSV files anywhere between 5-100MB on a nightly basis. We were looking at loading these to HDFS via Flafka. I have looked at using SpoolingDir to monitor a NFS which the file will land, these would then be written to a Kafka Channel then use a HDFS sink to write these files to HDFS. Our idea was to create a Kafka topic per system providing the CSV files. 

Im trying to understand how these CSV files will be stored on the Topic, the Kafka documentaion talks about the concept of of Messages.

When using SpoolingDir is a message considered as a complete source file or is a message a single row in this source file?

The reason i ask is that i am trying to understand what roll settings i should be using. 





Re: Flafka question

Master Guru
Flume's SpoolingDirectorySource [1] by default runs with a "LINE" deserialiser [2] which is applied into its file event reader [3]. This would mean that the source opens each file, and reads individual line-based rows out of it, and forms them as individual events in Flume. Thereby, the channel the source writes to, will receive independent rows (lines) in Flume event structure.

You do have control over this - you can specify a different custom-written deserialiser class to have it behave differently, i.e. read the entire file as-is into a single event, etc.

Flume does offer one such inbuilt whole-file reading deserialiser called the 'BlobDeserializer', documented at [4], for use in conjunction with a spooling directory source.

[1] -
[2] -
[3] -
[4] -

Re: Flafka question

New Contributor

thanks for the great response. Makes it very clear. 


I am assuming now when I read the Kafka topic with HDFS sink files on HDFS will be created/written based on the roll interval/size that I use and not the original file, unless of course i use the Blobdeserializer. 

Re: Flafka question

Master Guru
That's right.

Also worth thinking in the mindset that once something gets inside Flume from a source, in whatever size/form, it is represented as an 'event' that carries some X size, not as a whole 'file'. This is true of blob based deserialisers too.

The sink roll size settings is agnostic of what an event carries, and is only concerned with 'how many bytes did I write so far into my open file?' before every event write. So even with blob based large events your roll size settings will be honoured to some degree.

For ex. if you configure your roll size factor as 10 MB, but all events come in with ~100 MB size, then you will naturally observe one file (of ~100 MB size) per event due to the nature of the roll size factor realising it exceeded its limit _after_ the first write was done.

Re: Flafka question

Cloudera Employee

Flume is based on events. So a single line will be considered a message.


EDIT: What Harsh Said. He's never wrong.

Don't have an account?
Coming from Hortonworks? Activate your account here