Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HOW TO SCALE FLUME AGENT

HOW TO SCALE FLUME AGENT

New Contributor

Hi every one,

I have a question about scaling Flume, the context is we receive a lot of CSV files on edge node, and we plane to put it to HDFS, using Flume, my question is how to scale (parallelization?) the Flume Agent, and the second question is how Flume is recovering the fail in case of loss connection or something like that, is there a commit offset like approach in Flume Agent?

Thanks in advance

2 REPLIES 2
Highlighted

Re: HOW TO SCALE FLUME AGENT

Expert Contributor

To your second question, flume itself is very reliable and uses a transactional approach, as per flume docs:

The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.

However, the reliability depends largely on the channel that you are using:

The events are staged in the channel, which manages recovery from failure. Flume supports a durable file channel which is backed by the local file system. There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered.

So if you want absolute reliability, you have a tradeoff on speed (memory channel being faster).

There's a KafkaChannel too if you want offset/topic mechanism. https://flume.apache.org/FlumeUserGuide.html#kafka-channel

To your first question, you can do this based on the channel/source configuration you select.

For example you can have multiple Spooling Directory sources (placing csv's in them) and configure multiple channels to read from each.

You can have multiple topics (KafkaChannel for each) to read data from csv and then use KafkaSink to publish them in a single destination topic or HDFS sink (to a single directory)

Or other strategies that you may come up with.

Highlighted

Re: HOW TO SCALE FLUME AGENT

New Contributor

the second suggestion, I think is not secure in sens that you will mix a multiple files into Kafka then save them inito HDFS, in that case, you will miss the file origin. for the first solution I will challenge it.

thank you again

Don't have an account?
Coming from Hortonworks? Activate your account here