Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Kafka source to HDFS sink through Flume putting data in chunks in two different files

avatar
Expert Contributor

Previously, I had been dealing with a data loss issue which seems to be fixed in the most unusual way.

I have a Kafka source to an HDFS sink using Flume. It is now in the habit of creating two open .tmp files that it will put a chunk of events in one and then stop and immediately put the next chunk of events in the other and then flip back to the other one for the next chunk of events. I don't see data loss during this but it seems odd that Flume is doing this with the HDFS files as I don't recall specifying I wanted this to happen. Can anyone give me some insight as to why this might be occurring? (I even have maxOpenFiles = 1 in hopes this would fix the problem but it didn't)

Here is my flume configuration file:

# Flume agent config

a1.sources = kafka-source

a1.channels = channel1

a1.sinks = hdfs-sink

a1.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource

a1.sources.kafka-source.zookeeperConnect = 10.10.x.x:xxx

a1.sources.kafka-source.topic = firewall

a1.sources.kafka-source.groupId = flume

a1.sources.kafka-source.channels = channel1

a1.channels.channel1.type = memory

a1.channels.channel1.capacity = 100000

a1.channels.channel1.transactionCapacity = 1000

a1.channels.channel1.byteCapacity = 5000000000

a1.channels.channel1.byteCapacityBufferPercentage = 10

a1.sinks.hdfs-sink.type = hdfs

a1.sinks.hdfs-sink.hdfs.fileType = DataStream

a1.sinks.hdfs-sink.hdfs.path = /topics/%{topic}/%m-%d-%Y

a1.sinks.hdfs-sink.hdfs.filePrefix = firewall1

a1.sinks.hdfs-sink.hdfs.fileSuffix = .log

a1.sinks.hdfs-sink.hdfs.rollInterval = 86400

a1.sinks.hdfs-sink.hdfs.rollSize = 0

a1.sinks.hdfs-sink.hdfs.rollCount = 0

a1.sinks.hdfs-sink.hdfs.maxOpenFiles = 1

a1.sinks.hdfs-sink.hdfs.batchSize = 1000

a1.sinks.hdfs-sink.channel = channel1

1 ACCEPTED SOLUTION

avatar
Explorer

Can you provide the full names of the tmp files (with path) ?

Do you have multiple agents running ?

View solution in original post

5 REPLIES 5

avatar
Explorer

Can you provide the full names of the tmp files (with path) ?

Do you have multiple agents running ?

avatar
Expert Contributor

/topics/firewall/01-10-2017/firewall1.1484078093784.log.tmp

/topics/firewall/01-10-2017/firewall1.1484076220477.log.tmp

I only have one agent running.

As a side note, I have removed the hdfs.batchSize, channel1.byteCapacity, and channel1.byteCapacityBufferPercentage parameters and started it up again. It then only started producing one .tmp file: /topics/firewall/01-10-2017/firewall1.1484079147746.log.tmp

This would lead me to believe that those parameters were the culprit but not necessarily. As a side note, the reason I included those parameters is I was experiencing data loss as per the other question I put on here: https://community.hortonworks.com/questions/76473/data-loss-missing-using-flume-with-kafka-source-an...

I expect now that I've returned to my configuration that experienced data loss I will see these missing events again (yet now I only have it going to one file again)

Does Flume split them up for some reason into multiple files because of the batchSize or byteCapacity parameters?

avatar
Expert Contributor

Now it seems another .tmp file has arrived as I did a refresh. As a side note, I do a rolling restart on the flumes that exist in our cluster each time.. but I think the flume that is being used to grab this data is from one server. Plus this double .tmp files didn't exist a week ago (was only putting one .tmp file in each folder as we wanted - sadly with the data loss though...)

avatar
Expert Contributor

For clarity of how two files are divided Ill use variables file1 and file2 to illustrate:

file1 begins with an event at 20:10:53 and continues without skipping events until 20:23:20

file2 begins with an events 20:23:20 and continued till 20:26:53

If it follows the same pattern as in the past, file2 will stop at some point say 20:30:00 and then file1 will start having events appended to it where file2 left off and it goes back and forth, back and forth

avatar
Expert Contributor

Issue is resolved. It has to do with the cluster set up we have. Once I turned off the flume agents on all but the one machine we are using in the config, we experienced no data loss and only one .tmp file.