Created 04-12-2016 09:06 PM
We have 4 apps running Flume, and are experiencing performance issues and running out of file descriptors. We have 4 apps, running 4 instances each across 16 data nodes. They have approximate volumes:
App A - 60 GB per month
App B - 150 KB per month
App C - 54 GB per day
App D - 330 GB per day
We have been advised to move these onto dedicated hosts (4 hosts running 1 agent for each app = 4 per node). My Questions are:
1. Is this a best practice for placement of Flume Agents?
2. With this cause downsides with data locality of HDFS files that are written out?
Created 04-12-2016 10:24 PM
@jbarnett, (1) Yes, putting Flume on dedicated nodes is definitely the way to go. Both your Flume apps and your Data nodes will benefit from it, and you can scale Flume independently of the rest of the cluster. (2) Again, yes, there is a downside regarding HDFS locality but it's a small one in comparison to gains obtained by (1). And it only concerns HDFS sinks. Once you start using for example Kafka you will hava Kafka sinks and no concerns of that kind.
Created 04-12-2016 10:24 PM
@jbarnett, (1) Yes, putting Flume on dedicated nodes is definitely the way to go. Both your Flume apps and your Data nodes will benefit from it, and you can scale Flume independently of the rest of the cluster. (2) Again, yes, there is a downside regarding HDFS locality but it's a small one in comparison to gains obtained by (1). And it only concerns HDFS sinks. Once you start using for example Kafka you will hava Kafka sinks and no concerns of that kind.