Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best Practice for Flume placement - data nodes vs dedicated nodes

avatar
Expert Contributor

We have 4 apps running Flume, and are experiencing performance issues and running out of file descriptors. We have 4 apps, running 4 instances each across 16 data nodes. They have approximate volumes:

App A - 60 GB per month

App B - 150 KB per month

App C - 54 GB per day

App D - 330 GB per day

We have been advised to move these onto dedicated hosts (4 hosts running 1 agent for each app = 4 per node). My Questions are:

1. Is this a best practice for placement of Flume Agents?

2. With this cause downsides with data locality of HDFS files that are written out?

1 ACCEPTED SOLUTION

avatar
Master Guru

@jbarnett, (1) Yes, putting Flume on dedicated nodes is definitely the way to go. Both your Flume apps and your Data nodes will benefit from it, and you can scale Flume independently of the rest of the cluster. (2) Again, yes, there is a downside regarding HDFS locality but it's a small one in comparison to gains obtained by (1). And it only concerns HDFS sinks. Once you start using for example Kafka you will hava Kafka sinks and no concerns of that kind.

View solution in original post

1 REPLY 1

avatar
Master Guru

@jbarnett, (1) Yes, putting Flume on dedicated nodes is definitely the way to go. Both your Flume apps and your Data nodes will benefit from it, and you can scale Flume independently of the rest of the cluster. (2) Again, yes, there is a downside regarding HDFS locality but it's a small one in comparison to gains obtained by (1). And it only concerns HDFS sinks. Once you start using for example Kafka you will hava Kafka sinks and no concerns of that kind.