Support Questions

Jagatheeshr · ‎10-14-2015

Would like know if there is a way to run flume in HA mode.

orenault · ‎10-28-2015

Here is some of the key points to use Flume in "HA"

1. Setup File Channels instead of Memory Channels (using a RAID array is very paranoid but possible) on any Flume agent in use

2. Create a nanny process/script to watch for flume agent failures and restart immediately

3. Put the Flume agent collector/aggregation/2nd tier behind a network load balancer and use a VIP. This also has the benefit for balancing load for high ingest

4. Optionally have a sink that dumps to cycling files (separate from the drive the File Channel operates on) on the local drives in addition to a sink that forwards it on the next flume node or directly to HDFS. At least then you have the time it takes to fill a drive to correct any major issues and recover lost ingest streams.

5. Use the built in JMX counters in Flume to setup alerts in your favorite Operations Center application

View solution in original post

deepesh1 · ‎10-14-2015

I don't think there is HA in Flume. If you are worried about losing events because of Flume Agent going down you can use the File Channel which uses checkpointing. This makes sure that no events are lost while the Flume Agent is down and can begin to send event to sink from where it left off.

In case you are worried about the destination sink your agent is writing to going down then you can use the Failover Sink Processor.

Jagatheeshr · ‎10-14-2015

Thanks @Deepesh. File Channel would solve the Data Loss problem and failover sink processer address the issue with the sink failure rather than the flume failure.

What if the flume agent on a node gets killed and as a result there is no message is passed to the sink. Wouldn't it be a good idea to have another Flume agent registered in a Zookeeper to periodically check if the other flume agent is alive,if ever it dies then this can start piping the data to the sink.

deepesh1 · ‎10-15-2015

Its hard to give a generic answer on how to achieve high availability without knowing the topology the data and form of ingestion and where and how it is written in destination. In many cases if the data at source is available even if the agent gets killed, upon restarting the agent the checkpointing on the file channel will let the agent recover from the point where it failed. Sometimes topology has multiple Flume agents started for availability, ofcourse there will be issue with data redundancy but thats fine in some cases.

orenault · ‎10-28-2015