Support Questions

tseader · ‎08-19-2016

Problem: When ingesting avro event data from Kafka, the HDFS Sink keeps rolling files when they are very small (hundreds of bytes), despite my Flume configuration. I have made the proper configuration settings I believe, and I'm at a bit of a loss.

Flume Config:

a1.channels = ch-1
a1.sources = src-1
a1.sinks = snk-1

a1.sources.src-1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.src-1.channels = ch-1
a1.sources.src-1.zookeeperConnect = <OMITTED>
a1.sources.src-1.topic = aTopic
a1.sources.src-1.groupID = aTopic

#Inject the Schema into the header so the AvroEventSerializer can pick it up
a1.sources.src-1.interceptors=i1
a1.sources.src-1.interceptors.i1.type = static
a1.sources.src-1.interceptors.i1.key=flume.avro.schema.url
a1.sources.src-1.interceptors.i1.value=hdfs://aNameService/data/schema/simpleSchema.avsc


a1.channels.ch-1.type = memory


a1.sinks.snk-1.type = hdfs
a1.sinks.snk-1.channel = ch-1
a1.sinks.snk-1.hdfs.path = /data/table
a1.sinks.snk-1.hdfs.filePrefix = events
a1.sinks.snk-1.hdfs.fileSuffix = .avro
a1.sinks.snk-1.hdfs.rollInterval = 0
#Expecting 100MB files before rolling
a1.sinks.snk-1.hdfs.rollSize = 100000000
a1.sinks.snk-1.rollCount = 0
a1.sinks.snk-1.hdfs.batchSize = 1000
a1.sinks.snk-1.hdfs.fileType = DataStream
a1.sinks.snk-1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder

I'll also note that I tried adding other configuration settings that didn't help and I omitted any of them from this config to improve clarity. I also saw that the resolution for some people was to check the replication factor as that is a determining factor in the BucketWriter - I am receiving no errors in the logs relating to under replication.

Lastly, I am executing this from the command line and not through Cloudera Manager.

Thanks for any help

pdvorak · ‎08-22-2016

This line is missing the hdfs prefix:

a1.sinks.snk-1.rollCount = 0

It should be:

a1.sinks.snk-1.hdfs.rollCount = 0

Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.

-pd

View solution in original post

pdvorak · ‎08-22-2016

This line is missing the hdfs prefix:

a1.sinks.snk-1.rollCount = 0

It should be:

a1.sinks.snk-1.hdfs.rollCount = 0

Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.

-pd

tseader · ‎08-22-2016

yes, thanks for the reply! I figured out the same thing earlier today as I went back to the Flume User Guide and started copying and pasting the properties in again...

When I reviewed my config initiall, i didn't look before the attribute name to even see I was missing "hdfs". Definitely an ID10T and PEBKAC error. 🙂

Thanks for keeping me honest!

Cloudera Community

Support Questions

Flume HDFS Sink - File Roll Settings not Working

Flume: HDFS sink: Can't write large files

Flume HDFS sink %y-%m-%d issue

got confused with Flume spoolDirsrc=> morphlineInt...

Data loss (missing) using Flume with Kafka source ...

Ambari Rolling & Express Upgrade

Can we use variable hostname for flume sink config...

Flume Hive sink error

Flume TAILDIR Source to Kafka Sink- Static Interce...

HiveSink for Flume

Will Flume cut off events or event batches when us...