Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Data loss (missing) using Flume with Kafka source and HDFS sink

avatar
Expert Contributor

I am experiencing data loss (skipping chunks of time in data) when I am pulling data off a kafka topic as a source and putting it into an HDFS file (DataStream) as a sink. The pattern seems to be in 10, 20 or 30 minute blocks of data skipping. I have verified that the skipped data is in the topic .log file that is being generated by Kafka. (The original data is coming from a syslog, going through a different flume agent and being put into the Kafka topic - the data loss isn't happening there). I find it interesting and unusual that the blocks of skipped data are always 10, 20 or 30 mins and happen at least once an hour in my data.

Here is a copy of my configuration file:

a1.sources = kafka-source

a1.channels = memory-channel

a1.sinks = hdfs-sink

a1.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource

a1.sources.kafka-source.zookeeperConnect = 10.xx.x.xx:xxxx

a1.sources.kafka-source.topic = firewall

a1.sources.kafka-source.groupId = flume

a1.sources.kafka-source.channels = memory-channel

a1.channels.memory-channel.type = memory

a1.channels.memory-channel.capacity = 100000

a1.channels.memory-channel.transactionCapacity = 1000

a1.sinks.hdfs-sink.type = hdfs

a1.sinks.hdfs-sink.hdfs.fileType = DataStream

a1.sinks.hdfs-sink.hdfs.path = /topics/%{topic}/%m-%d-%Y

a1.sinks.hdfs-sink.hdfs.filePrefix = firewall1

a1.sinks.hdfs-sink.hdfs.fileSuffix = .log

a1.sinks.hdfs-sink.hdfs.rollInterval = 86400

a1.sinks.hdfs-sink.hdfs.rollSize = 0

a1.sinks.hdfs-sink.hdfs.rollCount = 0

a1.sinks.hdfs-sink.hdfs.maxOpenFiles = 1

a1.sinks.hdfs-sink.channel = memory-channel

I also cannot find any indication in the kafka logs or flume logs of any data loss.

1 ACCEPTED SOLUTION

avatar
Explorer

This may be related to the other problem you have posted (dual temp files).

If you are using multiple sinks or agents... make sure each one is writing to a different file/directory ... otherwise they will overwrite each other and appear like data loss.

With only one agent running, for your case there should be only one tmp file to which writes are currently happening. After rollInterval, that tmp file should get closed and new data should go into the new tmp file. The old tmp file should get closed and lose its .tmp suffix. If you are seeing many open tmp files, that could be an indication of intermittent network/other issues causing flume to not write and close the tmp files in Hdfs properly. So then it opens a new file without properly closing the old tmp file.

Another potential for data loss is if you are restarting the flume agent or noticing any crashes. The memory channel will lose data in those cases.

Suggestions: if possible hourly rolling.

View solution in original post

7 REPLIES 7

avatar
Expert Contributor

As an update to this, I've discovered that when I create a second sink and have it write to a file, I am still experiencing data loss during the same times, but its on a smaller scale. Writing to a file causes it to experience data loss by the seconds 10-30 seconds and then data written to HDFS experiences data loss by the minutes 10-30 minutes.

avatar
Explorer

This may be related to the other problem you have posted (dual temp files).

If you are using multiple sinks or agents... make sure each one is writing to a different file/directory ... otherwise they will overwrite each other and appear like data loss.

With only one agent running, for your case there should be only one tmp file to which writes are currently happening. After rollInterval, that tmp file should get closed and new data should go into the new tmp file. The old tmp file should get closed and lose its .tmp suffix. If you are seeing many open tmp files, that could be an indication of intermittent network/other issues causing flume to not write and close the tmp files in Hdfs properly. So then it opens a new file without properly closing the old tmp file.

Another potential for data loss is if you are restarting the flume agent or noticing any crashes. The memory channel will lose data in those cases.

Suggestions: if possible hourly rolling.

avatar
Expert Contributor

I do notice that sometimes it will crash when I restart flume but the data loss occurs in times when I am not restarting it so I doubt that is the culprit.

You've given me a lot of different avenues to explore as possible causes. We are running a cluster with two masters and 5 slaves. When I do the rolling restart it restarts all 7 so flume is running on all 7 machines. But in the config file, the only kafka sink we are getting data from is listed (see config file above)... its true, we have data being sent from Syslog to Kafka on master1 and master2 but only master1's Kafka source a flume agent is retrieving data from. So there is only one flume agent active, I believe.

avatar
Explorer

If you have Syslog as the source, then there is a possibility that the mem channel is full sometimes and cannot accept incoming Syslog messages. Since Syslog does not retry sending to flume, the data might be getting dropped.

avatar
Expert Contributor

Thanks but actually while syslog is our original source its not the source for the hdfs sink.

We have syslog source -> kafka sink

Then kafka source -> hdfs sink

The data loss isn't occurring but as I specified in my other question, I'm still getting two .tmp files. Im trying to discover if I could have more than one flume agent running but I don't see but one flume agent running. I thought maybe since we have two masters in our cluster, I would stop one of them and let the other run - see if maybe if somehow a flume on master1 was creating one tmp file and a flume on master2 is creating the other and Im getting unusual results.

For instance, it seemed to indicate this was the issue when I stopped m1 and then only one .tmp resolved itself. When I stopped m2, the other resolved itself. Oddly enough though when I started m1 again, two tmp files appeared. And when I stopped it, only one resolved itself! Then I started m2 again and a new tmp file appeared! I'm completely baffled. I dont see how m2 could be generating into hdfs as in the configuration file we never mention the IP address of m2, only of m1...

I am starting to think theres a concept of the clustering that is causing this that I don't understand.

avatar
Expert Contributor

Wow, more unusual behavior now... stopping one of the slaves on our cluster caused one of the tmp files to resolve itself and then immediately another tmp file appeared. I'm going to try stopping all the flumes in m1, m2, and the 5 slaves and only start the one on m1

avatar
Expert Contributor

Issue is resolved. It has to do with the cluster set up we have. Once I turned off the flume agents on all but the one machine we are using in the config, we experienced no data loss and only one .tmp file.