Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Flume HDFS Sink - File Roll Settings not Working

SOLVED Go to solution

Flume HDFS Sink - File Roll Settings not Working

Rising Star

Problem: When ingesting avro event data from Kafka, the HDFS Sink keeps rolling files when they are very small (hundreds of bytes), despite my Flume configuration. I have made the proper configuration settings I believe, and I'm at a bit of a loss.

 

Flume Config:

a1.channels = ch-1
a1.sources = src-1
a1.sinks = snk-1

a1.sources.src-1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.src-1.channels = ch-1
a1.sources.src-1.zookeeperConnect = <OMITTED>
a1.sources.src-1.topic = aTopic
a1.sources.src-1.groupID = aTopic

#Inject the Schema into the header so the AvroEventSerializer can pick it up
a1.sources.src-1.interceptors=i1
a1.sources.src-1.interceptors.i1.type = static
a1.sources.src-1.interceptors.i1.key=flume.avro.schema.url
a1.sources.src-1.interceptors.i1.value=hdfs://aNameService/data/schema/simpleSchema.avsc


a1.channels.ch-1.type = memory


a1.sinks.snk-1.type = hdfs
a1.sinks.snk-1.channel = ch-1
a1.sinks.snk-1.hdfs.path = /data/table
a1.sinks.snk-1.hdfs.filePrefix = events
a1.sinks.snk-1.hdfs.fileSuffix = .avro
a1.sinks.snk-1.hdfs.rollInterval = 0
#Expecting 100MB files before rolling
a1.sinks.snk-1.hdfs.rollSize = 100000000
a1.sinks.snk-1.rollCount = 0
a1.sinks.snk-1.hdfs.batchSize = 1000
a1.sinks.snk-1.hdfs.fileType = DataStream
a1.sinks.snk-1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder

I'll also note that I tried adding other configuration settings that didn't help and I omitted any of them from this config to improve clarity. I also saw that the resolution for some people was to check the replication factor as that is a determining factor in the BucketWriter - I am receiving no errors in the logs relating to under replication. 

 

Lastly, I am executing this from the command line and not through Cloudera Manager.

 

Thanks for any help

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Flume HDFS Sink - File Roll Settings not Working

Super Collaborator

This line is missing the hdfs prefix:

a1.sinks.snk-1.rollCount = 0

 

It should be:

a1.sinks.snk-1.hdfs.rollCount = 0

Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.

 

-pd

2 REPLIES 2

Re: Flume HDFS Sink - File Roll Settings not Working

Super Collaborator

This line is missing the hdfs prefix:

a1.sinks.snk-1.rollCount = 0

 

It should be:

a1.sinks.snk-1.hdfs.rollCount = 0

Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.

 

-pd

Highlighted

Re: Flume HDFS Sink - File Roll Settings not Working

Rising Star

yes, thanks for the reply!  I figured out the same thing earlier today as I went back to the Flume User Guide and started copying and pasting the properties in again... 

 

When I reviewed my config initiall, i didn't look before the attribute name to even see I was missing "hdfs".  Definitely an ID10T and PEBKAC error. :)

 

Thanks for keeping me honest!