10-04-2015 09:19 PM
CDH5.2 installed with Cloudera Manager and Parcels
Are Flume Channels isolated with each other? It seems when I have problem with a channel, other channel is affected.
I want to record and process Syslog data with Flume using 2 Channel+Sink (channels are replicating) as follows:
When Spark Streaming is running, the above works well. Data were handled by both sink correctly.
However, when the Spark Streaming job hanged / stopped, the avrochannnel had network related exceptions and ChannelFullException. This is understandable because the events could not be sent. The problem was that the amount of raw data logged by hdfschannel+hdfssink became around 1-2% of normal condition.
Is this expected? I don't understand why error with an optional channel affect others.
(Note: the use of File Channel was historical. But this seems not the cause of the behaviour anyway?)
01-05-2016 07:27 PM
Replying myself. I worked around this with Sink Groups and a Null Sink.
Relevant settings in flume.conf
a1.sinks = hdfssink avrosink nullsink a1.sinkgroups = avrosinkgroup a1.sinkgroups.avrosinkgroup.sinks = avrosink nullsink a1.sinkgroups.avrosinkgroup.processor.type = failover a1.sinkgroups.avrosinkgroup.processor.priority.avrosink = 100 a1.sinkgroups.avrosinkgroup.processor.priority.nullsink = 10 a1.sinks.nullsink.type = null a1.sinks.nullsink.channel = avrochannel a1.sinks.nullsink.batchsize = 10000
The end result is that avrochannel use the high priority avrosink (priority=100) normally. If this sink fails, it failover to the low prioirty nullsink, which simply discard the events.