Created on 08-19-2016 09:53 PM - edited 09-16-2022 03:35 AM
Problem: When ingesting avro event data from Kafka, the HDFS Sink keeps rolling files when they are very small (hundreds of bytes), despite my Flume configuration. I have made the proper configuration settings I believe, and I'm at a bit of a loss.
Flume Config:
a1.channels = ch-1 a1.sources = src-1 a1.sinks = snk-1 a1.sources.src-1.type = org.apache.flume.source.kafka.KafkaSource a1.sources.src-1.channels = ch-1 a1.sources.src-1.zookeeperConnect = <OMITTED> a1.sources.src-1.topic = aTopic a1.sources.src-1.groupID = aTopic #Inject the Schema into the header so the AvroEventSerializer can pick it up a1.sources.src-1.interceptors=i1 a1.sources.src-1.interceptors.i1.type = static a1.sources.src-1.interceptors.i1.key=flume.avro.schema.url a1.sources.src-1.interceptors.i1.value=hdfs://aNameService/data/schema/simpleSchema.avsc a1.channels.ch-1.type = memory a1.sinks.snk-1.type = hdfs a1.sinks.snk-1.channel = ch-1 a1.sinks.snk-1.hdfs.path = /data/table a1.sinks.snk-1.hdfs.filePrefix = events a1.sinks.snk-1.hdfs.fileSuffix = .avro a1.sinks.snk-1.hdfs.rollInterval = 0 #Expecting 100MB files before rolling a1.sinks.snk-1.hdfs.rollSize = 100000000 a1.sinks.snk-1.rollCount = 0 a1.sinks.snk-1.hdfs.batchSize = 1000 a1.sinks.snk-1.hdfs.fileType = DataStream a1.sinks.snk-1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
I'll also note that I tried adding other configuration settings that didn't help and I omitted any of them from this config to improve clarity. I also saw that the resolution for some people was to check the replication factor as that is a determining factor in the BucketWriter - I am receiving no errors in the logs relating to under replication.
Lastly, I am executing this from the command line and not through Cloudera Manager.
Thanks for any help
Created 08-22-2016 09:28 AM
This line is missing the hdfs prefix:
a1.sinks.snk-1.rollCount = 0
It should be:
a1.sinks.snk-1.hdfs.rollCount = 0
Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.
-pd
Created 08-22-2016 09:28 AM
This line is missing the hdfs prefix:
a1.sinks.snk-1.rollCount = 0
It should be:
a1.sinks.snk-1.hdfs.rollCount = 0
Otherwise all your files will contain 10 events, which is the default hdfs.rollCount.
-pd
Created 08-22-2016 05:43 PM
yes, thanks for the reply! I figured out the same thing earlier today as I went back to the Flume User Guide and started copying and pasting the properties in again...
When I reviewed my config initiall, i didn't look before the attribute name to even see I was missing "hdfs". Definitely an ID10T and PEBKAC error. 🙂
Thanks for keeping me honest!