Created 01-10-2017 07:08 PM
Previously, I had been dealing with a data loss issue which seems to be fixed in the most unusual way.
I have a Kafka source to an HDFS sink using Flume. It is now in the habit of creating two open .tmp files that it will put a chunk of events in one and then stop and immediately put the next chunk of events in the other and then flip back to the other one for the next chunk of events. I don't see data loss during this but it seems odd that Flume is doing this with the HDFS files as I don't recall specifying I wanted this to happen. Can anyone give me some insight as to why this might be occurring? (I even have maxOpenFiles = 1 in hopes this would fix the problem but it didn't)
Here is my flume configuration file:
# Flume agent config
a1.sources = kafka-source
a1.channels = channel1
a1.sinks = hdfs-sink
a1.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.kafka-source.zookeeperConnect = 10.10.x.x:xxx
a1.sources.kafka-source.topic = firewall
a1.sources.kafka-source.groupId = flume
a1.sources.kafka-source.channels = channel1
a1.channels.channel1.type = memory
a1.channels.channel1.capacity = 100000
a1.channels.channel1.transactionCapacity = 1000
a1.channels.channel1.byteCapacity = 5000000000
a1.channels.channel1.byteCapacityBufferPercentage = 10
a1.sinks.hdfs-sink.type = hdfs
a1.sinks.hdfs-sink.hdfs.fileType = DataStream
a1.sinks.hdfs-sink.hdfs.path = /topics/%{topic}/%m-%d-%Y
a1.sinks.hdfs-sink.hdfs.filePrefix = firewall1
a1.sinks.hdfs-sink.hdfs.fileSuffix = .log
a1.sinks.hdfs-sink.hdfs.rollInterval = 86400
a1.sinks.hdfs-sink.hdfs.rollSize = 0
a1.sinks.hdfs-sink.hdfs.rollCount = 0
a1.sinks.hdfs-sink.hdfs.maxOpenFiles = 1
a1.sinks.hdfs-sink.hdfs.batchSize = 1000
a1.sinks.hdfs-sink.channel = channel1
Created 01-10-2017 08:10 PM
Can you provide the full names of the tmp files (with path) ?
Do you have multiple agents running ?
Created 01-10-2017 08:10 PM
Can you provide the full names of the tmp files (with path) ?
Do you have multiple agents running ?
Created 01-10-2017 08:25 PM
/topics/firewall/01-10-2017/firewall1.1484078093784.log.tmp
/topics/firewall/01-10-2017/firewall1.1484076220477.log.tmp
I only have one agent running.
As a side note, I have removed the hdfs.batchSize, channel1.byteCapacity, and channel1.byteCapacityBufferPercentage parameters and started it up again. It then only started producing one .tmp file: /topics/firewall/01-10-2017/firewall1.1484079147746.log.tmp
This would lead me to believe that those parameters were the culprit but not necessarily. As a side note, the reason I included those parameters is I was experiencing data loss as per the other question I put on here: https://community.hortonworks.com/questions/76473/data-loss-missing-using-flume-with-kafka-source-an...
I expect now that I've returned to my configuration that experienced data loss I will see these missing events again (yet now I only have it going to one file again)
Does Flume split them up for some reason into multiple files because of the batchSize or byteCapacity parameters?
Created 01-10-2017 08:27 PM
Now it seems another .tmp file has arrived as I did a refresh. As a side note, I do a rolling restart on the flumes that exist in our cluster each time.. but I think the flume that is being used to grab this data is from one server. Plus this double .tmp files didn't exist a week ago (was only putting one .tmp file in each folder as we wanted - sadly with the data loss though...)
Created 01-10-2017 08:32 PM
For clarity of how two files are divided Ill use variables file1 and file2 to illustrate:
file1 begins with an event at 20:10:53 and continues without skipping events until 20:23:20
file2 begins with an events 20:23:20 and continued till 20:26:53
If it follows the same pattern as in the past, file2 will stop at some point say 20:30:00 and then file1 will start having events appended to it where file2 left off and it goes back and forth, back and forth
Created 01-17-2017 04:28 PM
Issue is resolved. It has to do with the cluster set up we have. Once I turned off the flume agents on all but the one machine we are using in the config, we experienced no data loss and only one .tmp file.