Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Flume HDFS Sink - File Roll Settings not Working

avatar
Expert Contributor

Problem: When ingesting avro event data from Kafka, the HDFS Sink keeps rolling files when they are very small (hundreds of bytes), despite my Flume configuration. I have made the proper configuration settings I believe, and I'm at a bit of a loss.

 

Flume Config:

a1.channels = ch-1
a1.sources = src-1
a1.sinks = snk-1

a1.sources.src-1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.src-1.channels = ch-1
a1.sources.src-1.zookeeperConnect = <OMITTED>
a1.sources.src-1.topic = aTopic
a1.sources.src-1.groupID = aTopic

#Inject the Schema into the header so the AvroEventSerializer can pick it up
a1.sources.src-1.interceptors=i1
a1.sources.src-1.interceptors.i1.type = static
a1.sources.src-1.interceptors.i1.key=flume.avro.schema.url
a1.sources.src-1.interceptors.i1.value=hdfs://aNameService/data/schema/simpleSchema.avsc


a1.channels.ch-1.type = memory


a1.sinks.snk-1.type = hdfs
a1.sinks.snk-1.channel = ch-1
a1.sinks.snk-1.hdfs.path = /data/table
a1.sinks.snk-1.hdfs.filePrefix = events
a1.sinks.snk-1.hdfs.fileSuffix = .avro
a1.sinks.snk-1.hdfs.rollInterval = 0
#Expecting 100MB files before rolling
a1.sinks.snk-1.hdfs.rollSize = 100000000
a1.sinks.snk-1.rollCount = 0
a1.sinks.snk-1.hdfs.batchSize = 1000
a1.sinks.snk-1.hdfs.fileType = DataStream
a1.sinks.snk-1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder

I'll also note that I tried adding other configuration settings that didn't help and I omitted any of them from this config to improve clarity. I also saw that the resolution for some people was to check the replication factor as that is a determining factor in the BucketWriter - I am receiving no errors in the logs relating to under replication. 

 

Lastly, I am executing this from the command line and not through Cloudera Manager.

 

Thanks for any help

1 ACCEPTED SOLUTION

avatar

This line is missing the hdfs prefix:

a1.sinks.snk-1.rollCount = 0

 

It should be:

a1.sinks.snk-1.hdfs.rollCount = 0

Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.

 

-pd

View solution in original post

2 REPLIES 2

avatar

This line is missing the hdfs prefix:

a1.sinks.snk-1.rollCount = 0

 

It should be:

a1.sinks.snk-1.hdfs.rollCount = 0

Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.

 

-pd

avatar
Expert Contributor

yes, thanks for the reply!  I figured out the same thing earlier today as I went back to the Flume User Guide and started copying and pasting the properties in again... 

 

When I reviewed my config initiall, i didn't look before the attribute name to even see I was missing "hdfs".  Definitely an ID10T and PEBKAC error. 🙂

 

Thanks for keeping me honest!