Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Flume with oozie

avatar
Expert Contributor

We are using Flume to get the data into HDFS.After that we are running pig, hive for data transformation.Not sure how to trigger flume from oozie?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@vamsi valiveti you can trigger Flume from the oozie shell action. However pay attention that action will be executed on random cluster node, so all your nodes should have Flume installed. Also you will need to somehow control the agents after that, and if you have >10 nodes it became a problem.. That's why is not common scenario of flume usage.

I'd say the good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.

View solution in original post

8 REPLIES 8

avatar
Super Collaborator

Hi @vamsi valiveti,

Oozie is a scheduler and Flume is not working on a schedule basis instead Flume is treating the data when it receives it. So you use teh Flume configuration to tell for example that each time there is a file in a certain directory Flume will put it in hdfs (if you use the spooldir source) and so on.

/Best regards, Mats

avatar
Expert Contributor

a)I am starting flume agent using below command.In production how we will trigger this command currently I am running manually on unix command prompt and also i want to create dependeny with hive?

b)can i place below command in unix shell script and call it in shell action in oozie?

flume-ng agent --conf $FLUME_CONF_DIR --conf-file $FLUME_CONF_DIR/flume.conf --name Agent7

avatar
Expert Contributor

Hi @Mats Johansson

Any input on my clarifications

avatar
Super Collaborator

@vamsi valiveti you can trigger Flume from the oozie shell action. However pay attention that action will be executed on random cluster node, so all your nodes should have Flume installed. Also you will need to somehow control the agents after that, and if you have >10 nodes it became a problem.. That's why is not common scenario of flume usage.

I'd say the good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.

avatar
Expert Contributor

HI @Michael M

Thanks alot for your time.one small clarification

You mentioned good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.

clarification 1:-

How to keep Flume running all the time?currently i am using below command on my gateway node.

flume-ng agent --conf $FLUME_CONF_DIR --conf-file $FLUME_CONF_DIR/flume.conf --name Agent7

avatar
Super Collaborator

@vamsi valiveti the easiest way is to detach shell from the command using nohup:

nohup <my_command> &

Another option is to create flume init.d service script. I've posted some example script here (search for "Setup flume agent auto startup" on the page), and run the flume as a service.

And third option is to use Ambari to control the agents.

avatar
Expert Contributor

HI @Michael M

For first option:-

In production can I place below command in shell script and schedule that script using crontab so that it will run the Flume will run continuously since In production environment we are not allowed to run any command manually on gateway node.Please correct me if i am wrong?

nohup <my_command> &

avatar
Super Collaborator

@vamsi valiveti it could be the option, right.

But for production usage i'd think additionally about how to stop the agents and how to monitor the agent. From my experience init.d service script + ganglia monitoring is a best option.

It allows you to run/stop agents easily with the commands like: /etc/init.d/flume "agent" stop/start. And ganglia provides a nice web interface for the monitoring.