Reply
Contributor
Posts: 50
Registered: ‎08-05-2015
Accepted Solution

Flume HDFS Sink - File Roll Settings not Working

Problem: When ingesting avro event data from Kafka, the HDFS Sink keeps rolling files when they are very small (hundreds of bytes), despite my Flume configuration. I have made the proper configuration settings I believe, and I'm at a bit of a loss.

 

Flume Config:

a1.channels = ch-1
a1.sources = src-1
a1.sinks = snk-1

a1.sources.src-1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.src-1.channels = ch-1
a1.sources.src-1.zookeeperConnect = <OMITTED>
a1.sources.src-1.topic = aTopic
a1.sources.src-1.groupID = aTopic

#Inject the Schema into the header so the AvroEventSerializer can pick it up
a1.sources.src-1.interceptors=i1
a1.sources.src-1.interceptors.i1.type = static
a1.sources.src-1.interceptors.i1.key=flume.avro.schema.url
a1.sources.src-1.interceptors.i1.value=hdfs://aNameService/data/schema/simpleSchema.avsc


a1.channels.ch-1.type = memory


a1.sinks.snk-1.type = hdfs
a1.sinks.snk-1.channel = ch-1
a1.sinks.snk-1.hdfs.path = /data/table
a1.sinks.snk-1.hdfs.filePrefix = events
a1.sinks.snk-1.hdfs.fileSuffix = .avro
a1.sinks.snk-1.hdfs.rollInterval = 0
#Expecting 100MB files before rolling
a1.sinks.snk-1.hdfs.rollSize = 100000000
a1.sinks.snk-1.rollCount = 0
a1.sinks.snk-1.hdfs.batchSize = 1000
a1.sinks.snk-1.hdfs.fileType = DataStream
a1.sinks.snk-1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder

I'll also note that I tried adding other configuration settings that didn't help and I omitted any of them from this config to improve clarity. I also saw that the resolution for some people was to check the replication factor as that is a determining factor in the BucketWriter - I am receiving no errors in the logs relating to under replication. 

 

Lastly, I am executing this from the command line and not through Cloudera Manager.

 

Thanks for any help

Highlighted
Cloudera Employee
Posts: 277
Registered: ‎01-09-2014

Re: Flume HDFS Sink - File Roll Settings not Working

This line is missing the hdfs prefix:

a1.sinks.snk-1.rollCount = 0

 

It should be:

a1.sinks.snk-1.hdfs.rollCount = 0

Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.

 

-pd

Contributor
Posts: 50
Registered: ‎08-05-2015

Re: Flume HDFS Sink - File Roll Settings not Working

yes, thanks for the reply!  I figured out the same thing earlier today as I went back to the Flume User Guide and started copying and pasting the properties in again... 

 

When I reviewed my config initiall, i didn't look before the attribute name to even see I was missing "hdfs".  Definitely an ID10T and PEBKAC error. :)

 

Thanks for keeping me honest!