New Contributor
Posts: 1
Registered: ‎08-15-2016

Two Flume and One Hive Table (duplicate logs)

I have Hadoop cluster.

I want collect logs and I use Flume (syslog source).

But for HA, I up 2 instances of Flume and send all logs on all instances.

I use Hive Sink. (partition by field date from log)


How I can resolve problem with duplicate logs?

What are the possible solutions except deduplicate after or use kafka?

Cloudera Employee
Posts: 277
Registered: ‎01-09-2014

Re: Two Flume and One Hive Table (duplicate logs)

Set up your HA to send to either agent, but not both. If one agent fails then all traffic should go to the other agent. This could be done with a load balancer in front of the two flume instances.