Created 01-05-2017 02:39 PM
I am experiencing data loss (skipping chunks of time in data) when I am pulling data off a kafka topic as a source and putting it into an HDFS file (DataStream) as a sink. The pattern seems to be in 10, 20 or 30 minute blocks of data skipping. I have verified that the skipped data is in the topic .log file that is being generated by Kafka. (The original data is coming from a syslog, going through a different flume agent and being put into the Kafka topic - the data loss isn't happening there). I find it interesting and unusual that the blocks of skipped data are always 10, 20 or 30 mins and happen at least once an hour in my data.
Here is a copy of my configuration file:
a1.sources = kafka-source
a1.channels = memory-channel
a1.sinks = hdfs-sink
a1.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.kafka-source.zookeeperConnect = 10.xx.x.xx:xxxx
a1.sources.kafka-source.topic = firewall
a1.sources.kafka-source.groupId = flume
a1.sources.kafka-source.channels = memory-channel
a1.channels.memory-channel.type = memory
a1.channels.memory-channel.capacity = 100000
a1.channels.memory-channel.transactionCapacity = 1000
a1.sinks.hdfs-sink.type = hdfs
a1.sinks.hdfs-sink.hdfs.fileType = DataStream
a1.sinks.hdfs-sink.hdfs.path = /topics/%{topic}/%m-%d-%Y
a1.sinks.hdfs-sink.hdfs.filePrefix = firewall1
a1.sinks.hdfs-sink.hdfs.fileSuffix = .log
a1.sinks.hdfs-sink.hdfs.rollInterval = 86400
a1.sinks.hdfs-sink.hdfs.rollSize = 0
a1.sinks.hdfs-sink.hdfs.rollCount = 0
a1.sinks.hdfs-sink.hdfs.maxOpenFiles = 1
a1.sinks.hdfs-sink.channel = memory-channel
I also cannot find any indication in the kafka logs or flume logs of any data loss.
Created 01-10-2017 08:35 PM
This may be related to the other problem you have posted (dual temp files).
If you are using multiple sinks or agents... make sure each one is writing to a different file/directory ... otherwise they will overwrite each other and appear like data loss.
With only one agent running, for your case there should be only one tmp file to which writes are currently happening. After rollInterval, that tmp file should get closed and new data should go into the new tmp file. The old tmp file should get closed and lose its .tmp suffix. If you are seeing many open tmp files, that could be an indication of intermittent network/other issues causing flume to not write and close the tmp files in Hdfs properly. So then it opens a new file without properly closing the old tmp file.
Another potential for data loss is if you are restarting the flume agent or noticing any crashes. The memory channel will lose data in those cases.
Suggestions: if possible hourly rolling.
Created 01-05-2017 06:15 PM
As an update to this, I've discovered that when I create a second sink and have it write to a file, I am still experiencing data loss during the same times, but its on a smaller scale. Writing to a file causes it to experience data loss by the seconds 10-30 seconds and then data written to HDFS experiences data loss by the minutes 10-30 minutes.
Created 01-10-2017 08:35 PM
This may be related to the other problem you have posted (dual temp files).
If you are using multiple sinks or agents... make sure each one is writing to a different file/directory ... otherwise they will overwrite each other and appear like data loss.
With only one agent running, for your case there should be only one tmp file to which writes are currently happening. After rollInterval, that tmp file should get closed and new data should go into the new tmp file. The old tmp file should get closed and lose its .tmp suffix. If you are seeing many open tmp files, that could be an indication of intermittent network/other issues causing flume to not write and close the tmp files in Hdfs properly. So then it opens a new file without properly closing the old tmp file.
Another potential for data loss is if you are restarting the flume agent or noticing any crashes. The memory channel will lose data in those cases.
Suggestions: if possible hourly rolling.
Created 01-10-2017 08:44 PM
I do notice that sometimes it will crash when I restart flume but the data loss occurs in times when I am not restarting it so I doubt that is the culprit.
You've given me a lot of different avenues to explore as possible causes. We are running a cluster with two masters and 5 slaves. When I do the rolling restart it restarts all 7 so flume is running on all 7 machines. But in the config file, the only kafka sink we are getting data from is listed (see config file above)... its true, we have data being sent from Syslog to Kafka on master1 and master2 but only master1's Kafka source a flume agent is retrieving data from. So there is only one flume agent active, I believe.
Created 01-16-2017 07:26 PM
If you have Syslog as the source, then there is a possibility that the mem channel is full sometimes and cannot accept incoming Syslog messages. Since Syslog does not retry sending to flume, the data might be getting dropped.
Created 01-16-2017 07:33 PM
Thanks but actually while syslog is our original source its not the source for the hdfs sink.
We have syslog source -> kafka sink
Then kafka source -> hdfs sink
The data loss isn't occurring but as I specified in my other question, I'm still getting two .tmp files. Im trying to discover if I could have more than one flume agent running but I don't see but one flume agent running. I thought maybe since we have two masters in our cluster, I would stop one of them and let the other run - see if maybe if somehow a flume on master1 was creating one tmp file and a flume on master2 is creating the other and Im getting unusual results.
For instance, it seemed to indicate this was the issue when I stopped m1 and then only one .tmp resolved itself. When I stopped m2, the other resolved itself. Oddly enough though when I started m1 again, two tmp files appeared. And when I stopped it, only one resolved itself! Then I started m2 again and a new tmp file appeared! I'm completely baffled. I dont see how m2 could be generating into hdfs as in the configuration file we never mention the IP address of m2, only of m1...
I am starting to think theres a concept of the clustering that is causing this that I don't understand.
Created 01-16-2017 07:36 PM
Wow, more unusual behavior now... stopping one of the slaves on our cluster caused one of the tmp files to resolve itself and then immediately another tmp file appeared. I'm going to try stopping all the flumes in m1, m2, and the 5 slaves and only start the one on m1
Created 01-17-2017 04:28 PM
Issue is resolved. It has to do with the cluster set up we have. Once I turned off the flume agents on all but the one machine we are using in the config, we experienced no data loss and only one .tmp file.