07-27-2018 04:19 PM - last edited on 07-30-2018 06:45 AM by cjervis
I'm setting up Flume to take data from kafka and write to hdfs, using hdfs sink.
This is the sink's conf:
tier1.sinks.sink1.type = hdfs tier1.sinks.sink1.hdfs.path = /tmp/flume/%y-%m-%d tier1.sinks.sink1.hdfs.rollInterval = 3600 tier1.sinks.sink1.hdfs.rollSize = 0 tier1.sinks.sink1.hdfs.rollCount = 0 tier1.sinks.sink1.hdfs.round = true tier1.sinks.sink1.hdfs.roundvalue = 60 tier1.sinks.sink1.hdfs.roundUnit = minutes tier1.sinks.sink1.hdfs.useLocalTimeStamp = true tier1.sinks.sink1.hdfs.filePrefix = %H
And the relevant kafka topic has 4 partitions.
I expected this to do one of th following:
1. create 4 .tmp files and keep writing to them until the defined rolling policy is applied.
2. create 4 .tmp files and buffer some data somewhere untill the defined rolling policy is applied and then weire th data to the files.
I can't explain the behaviour that I see.
4 .tmp files are created. Some very small data is wrtiten to them (around 100k each file). and that it for thr whole hours.
When an hour is passed and the rolling policy is applied, then all the daa is written to them and files are changed to be without .tmp.
Can some one please explain this behaviour?
07-29-2018 05:51 PM