Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

​Flume in Production - To Ambari or not to Ambari, that is the question!

Solved Go to solution

​Flume in Production - To Ambari or not to Ambari, that is the question!

Expert Contributor

I need to find out what the best practice is for running a set of flume agents in production. All the answers I find dance around the issue. I am clear that when setting up in Ambari, you can create a number of config groups for Flume, and each agent needs to be concatenated into the flume.conf for that group. So, each agent runs 1 instance on each host associated with the configuration group.

At this point, you can see and restart individual agents through Ambari. However (and here’s the problem), if you make a change to any of the agents configuration or add a new one then you need to restart ALL of the agents in that group for the change to take effect! Not acceptable in my case where I have 4 apps running 2 or 3 agents each. It certainly does not seem to be acceptable to have to restart all applications flume agents whenever a change is made!

So, am I missing something or are large enterprises simply using shell scripts to start the agent on each host?

If they are using script, then what is being used for monitoring and auto-restart?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: ​Flume in Production - To Ambari or not to Ambari, that is the question!

Contributor

We manage our Flume-agents in Ambari. We have 3 'data-ingres'-nodes of many nodes. These nodes are bundled in a ConfigGroup, which are located at the top in Ambari > Flume > config with the name 'dataLoaders'.

The default flume.conf is empty, for the config-group 'dataLoaders' we override the default and add 2 agents:

  1. Pulling data from a Queue and put it in Kafka + HDFS
  2. Receiving JSON and placing it on a Kafka-topic.

Each host in the config-group will run the 2 agents, which can be restarted separately from the Ambari-flume summary page. When you have changed the config, it is traceable/audited in Ambari. A restart from Ambari will place the new config file for the flumes. Ambari-agent on the Flume host will inspect if the process is running and Alarm you when its dead. Ambari will help you when upgrading stack to latest version(s).

notes:

  • You cannot put a host in multiple config groups. (don't mix responsibilities)
  • The configuration is in plain text and no validation at all. (start and check /var/log/flume/**.log)
  • Rolling restart for a config group is not supported (restart flume-agents 1 by 1)
  • Ambari 'alive'-checks are super simple, locked-up agent is running, but not working...
  • Ambari Flume data insight charts are too simple, (Grafana coming, or use JMXExporter -> Prometheus)

View solution in original post

6 REPLIES 6
Highlighted

Re: ​Flume in Production - To Ambari or not to Ambari, that is the question!

@jbarnett

I say not Flume :)

have you tried NiFi ? You can can have several processors for your app, configure each one of them with some click in GUI !! you want re-configure a particular processor, no problem !! stop it, right click, configure it and run it again.

If you really want to use Flume, I recommend using a config file per agent as stated in the doc :

Hortonworks recommends that administrators use a separate configuration file for each Flume agent. .... While it is possible to use one large configuration file that specifies all the Flume components needed by all the agents, this is not typical of most production deployments. 

Since you have several agents in the same host, Ambari is not an option

Use NiFi !!

Highlighted

Re: ​Flume in Production - To Ambari or not to Ambari, that is the question!

Expert Contributor

@Abdelkrim Hadjidj

So, I AM trying to get the powers to be to switch over to NiFi, but in the mean time we have a short time frame to port what they have with as little changes as possible.

Under Starting Flume The document also shows starting Flume from the command line. In this scenario, you could put each one in a separate config file. I am just wondering if this is how most large enterprises are running in production. And, if so, how they are monitoring them.

BTW, I had accidentally posted this an answer, so not sure if everyone saw it.

Highlighted

Re: ​Flume in Production - To Ambari or not to Ambari, that is the question!

Contributor

Hi Jim! :-)

Our project is still around and getting bigger. We are using both Cloudera and Hortonworks and building more dataflows. With increased complexity, we are finding that Ambari more and more inadequate compared to Cloudera's full-featured commercial counterpart, Cloudera Manager. For Flume, there are only six metrics, four basic config attributes, and one big textbox for pasting in the config file. I have to hand-edit flume-env.sh to change the agent heap allocation.

(With apology to our hosts) While Hortonworks offers a goodie bag of latest Apache applications, the primitive state of the management console is a deal-breaker. If Ambari cannot be improved soon, I strongly recommend you consider Cloudera (we are using the free version).

Highlighted

Re: ​Flume in Production - To Ambari or not to Ambari, that is the question!

Mentor

Limitation on the Flume management is absolutely there but we make up for it with our NiFi support.

Highlighted

Re: ​Flume in Production - To Ambari or not to Ambari, that is the question!

Contributor

We manage our Flume-agents in Ambari. We have 3 'data-ingres'-nodes of many nodes. These nodes are bundled in a ConfigGroup, which are located at the top in Ambari > Flume > config with the name 'dataLoaders'.

The default flume.conf is empty, for the config-group 'dataLoaders' we override the default and add 2 agents:

  1. Pulling data from a Queue and put it in Kafka + HDFS
  2. Receiving JSON and placing it on a Kafka-topic.

Each host in the config-group will run the 2 agents, which can be restarted separately from the Ambari-flume summary page. When you have changed the config, it is traceable/audited in Ambari. A restart from Ambari will place the new config file for the flumes. Ambari-agent on the Flume host will inspect if the process is running and Alarm you when its dead. Ambari will help you when upgrading stack to latest version(s).

notes:

  • You cannot put a host in multiple config groups. (don't mix responsibilities)
  • The configuration is in plain text and no validation at all. (start and check /var/log/flume/**.log)
  • Rolling restart for a config group is not supported (restart flume-agents 1 by 1)
  • Ambari 'alive'-checks are super simple, locked-up agent is running, but not working...
  • Ambari Flume data insight charts are too simple, (Grafana coming, or use JMXExporter -> Prometheus)

View solution in original post

Highlighted

Re: ​Flume in Production - To Ambari or not to Ambari, that is the question!

Contributor

This was back in 2016, nowadays I would go for Nifi (open source) or StreamSets (free to use, pay for support)

Flume is deprecated in Hortonworks now and will be removed from in future releases 3.*: deprecations_HDP.

Don't have an account?
Coming from Hortonworks? Activate your account here