Reply
New Contributor
Posts: 5
Registered: ‎10-05-2016

Flafka question

 Hi, 

 

I am very new to Big data and have a question about how Flume and kafka work with files. 

 

We have a number of systems producing small CSV files anywhere between 5-100MB on a nightly basis. We were looking at loading these to HDFS via Flafka. I have looked at using SpoolingDir to monitor a NFS which the file will land, these would then be written to a Kafka Channel then use a HDFS sink to write these files to HDFS. Our idea was to create a Kafka topic per system providing the CSV files. 

Im trying to understand how these CSV files will be stored on the Topic, the Kafka documentaion talks about the concept of of Messages.

When using SpoolingDir is a message considered as a complete source file or is a message a single row in this source file?

The reason i ask is that i am trying to understand what roll settings i should be using. 

 

thanks 

Joe

Posts: 1,903
Kudos: 435
Solutions: 307
Registered: ‎07-31-2013

Re: Flafka question

Flume's SpoolingDirectorySource [1] by default runs with a "LINE" deserialiser [2] which is applied into its file event reader [3]. This would mean that the source opens each file, and reads individual line-based rows out of it, and forms them as individual events in Flume. Thereby, the channel the source writes to, will receive independent rows (lines) in Flume event structure.

You do have control over this - you can specify a different custom-written deserialiser class to have it behave differently, i.e. read the entire file as-is into a single event, etc.

Flume does offer one such inbuilt whole-file reading deserialiser called the 'BlobDeserializer', documented at [4], for use in conjunction with a spooling directory source.

[1] - http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#spooling-directory-source
[2] - https://github.com/cloudera/flume-ng/blob/cdh5.8.0-release/flume-ng-core/src/main/java/org/apache/fl...
[3] - https://github.com/cloudera/flume-ng/blob/cdh5.8.0-release/flume-ng-core/src/main/java/org/apache/fl...
[4] - http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#blobdeserializer
Cloudera Employee
Posts: 21
Registered: ‎07-08-2013

Re: Flafka question

[ Edited ]

Flume is based on events. So a single line will be considered a message. 

 

https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

 

EDIT: What Harsh Said. He's never wrong.

Highlighted
New Contributor
Posts: 5
Registered: ‎10-05-2016

Re: Flafka question

thanks for the great response. Makes it very clear. 

 

I am assuming now when I read the Kafka topic with HDFS sink files on HDFS will be created/written based on the roll interval/size that I use and not the original file, unless of course i use the Blobdeserializer. 

Posts: 1,903
Kudos: 435
Solutions: 307
Registered: ‎07-31-2013

Re: Flafka question

That's right.

Also worth thinking in the mindset that once something gets inside Flume from a source, in whatever size/form, it is represented as an 'event' that carries some X size, not as a whole 'file'. This is true of blob based deserialisers too.

The sink roll size settings is agnostic of what an event carries, and is only concerned with 'how many bytes did I write so far into my open file?' before every event write. So even with blob based large events your roll size settings will be honoured to some degree.

For ex. if you configure your roll size factor as 10 MB, but all events come in with ~100 MB size, then you will naturally observe one file (of ~100 MB size) per event due to the nature of the roll size factor realising it exceeded its limit _after_ the first write was done.