Created 05-06-2016 10:16 PM
I need to find out what the best practice is for running a set of flume agents in production. All the answers I find dance around the issue. I am clear that when setting up in Ambari, you can create a number of config groups for Flume, and each agent needs to be concatenated into the flume.conf for that group. So, each agent runs 1 instance on each host associated with the configuration group.
At this point, you can see and restart individual agents through Ambari. However (and here’s the problem), if you make a change to any of the agents configuration or add a new one then you need to restart ALL of the agents in that group for the change to take effect! Not acceptable in my case where I have 4 apps running 2 or 3 agents each. It certainly does not seem to be acceptable to have to restart all applications flume agents whenever a change is made!
So, am I missing something or are large enterprises simply using shell scripts to start the agent on each host?
If they are using script, then what is being used for monitoring and auto-restart?
Created 07-26-2016 05:46 PM
We manage our Flume-agents in Ambari. We have 3 'data-ingres'-nodes of many nodes. These nodes are bundled in a ConfigGroup, which are located at the top in Ambari > Flume > config with the name 'dataLoaders'.
The default flume.conf is empty, for the config-group 'dataLoaders' we override the default and add 2 agents:
Each host in the config-group will run the 2 agents, which can be restarted separately from the Ambari-flume summary page. When you have changed the config, it is traceable/audited in Ambari. A restart from Ambari will place the new config file for the flumes. Ambari-agent on the Flume host will inspect if the process is running and Alarm you when its dead. Ambari will help you when upgrading stack to latest version(s).
notes:
Created 05-06-2016 11:56 PM
I say not Flume 🙂
have you tried NiFi ? You can can have several processors for your app, configure each one of them with some click in GUI !! you want re-configure a particular processor, no problem !! stop it, right click, configure it and run it again.
If you really want to use Flume, I recommend using a config file per agent as stated in the doc :
Hortonworks recommends that administrators use a separate configuration file for each Flume agent. .... While it is possible to use one large configuration file that specifies all the Flume components needed by all the agents, this is not typical of most production deployments.
Since you have several agents in the same host, Ambari is not an option
Use NiFi !!
Created 05-10-2016 03:11 PM
So, I AM trying to get the powers to be to switch over to NiFi, but in the mean time we have a short time frame to port what they have with as little changes as possible.
Under Starting Flume The document also shows starting Flume from the command line. In this scenario, you could put each one in a separate config file. I am just wondering if this is how most large enterprises are running in production. And, if so, how they are monitoring them.
BTW, I had accidentally posted this an answer, so not sure if everyone saw it.
Created 05-13-2016 09:15 PM
Hi Jim! 🙂
Our project is still around and getting bigger. We are using both Cloudera and Hortonworks and building more dataflows. With increased complexity, we are finding that Ambari more and more inadequate compared to Cloudera's full-featured commercial counterpart, Cloudera Manager. For Flume, there are only six metrics, four basic config attributes, and one big textbox for pasting in the config file. I have to hand-edit flume-env.sh to change the agent heap allocation.
(With apology to our hosts) While Hortonworks offers a goodie bag of latest Apache applications, the primitive state of the management console is a deal-breaker. If Ambari cannot be improved soon, I strongly recommend you consider Cloudera (we are using the free version).
Created 05-14-2016 12:25 PM
Limitation on the Flume management is absolutely there but we make up for it with our NiFi support.
Created 07-26-2016 05:46 PM
We manage our Flume-agents in Ambari. We have 3 'data-ingres'-nodes of many nodes. These nodes are bundled in a ConfigGroup, which are located at the top in Ambari > Flume > config with the name 'dataLoaders'.
The default flume.conf is empty, for the config-group 'dataLoaders' we override the default and add 2 agents:
Each host in the config-group will run the 2 agents, which can be restarted separately from the Ambari-flume summary page. When you have changed the config, it is traceable/audited in Ambari. A restart from Ambari will place the new config file for the flumes. Ambari-agent on the Flume host will inspect if the process is running and Alarm you when its dead. Ambari will help you when upgrading stack to latest version(s).
notes:
Created 08-03-2017 06:37 AM
This was back in 2016, nowadays I would go for Nifi (open source) or StreamSets (free to use, pay for support)
Flume is deprecated in Hortonworks now and will be removed from in future releases 3.*: deprecations_HDP.