Created 12-27-2016 11:52 AM
We are using Flume to get the data into HDFS.After that we are running pig, hive for data transformation.Not sure how to trigger flume from oozie?
Created 12-29-2016 02:22 PM
@vamsi valiveti you can trigger Flume from the oozie shell action. However pay attention that action will be executed on random cluster node, so all your nodes should have Flume installed. Also you will need to somehow control the agents after that, and if you have >10 nodes it became a problem.. That's why is not common scenario of flume usage.
I'd say the good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.
Created 12-27-2016 01:52 PM
Hi @vamsi valiveti,
Oozie is a scheduler and Flume is not working on a schedule basis instead Flume is treating the data when it receives it. So you use teh Flume configuration to tell for example that each time there is a file in a certain directory Flume will put it in hdfs (if you use the spooldir source) and so on.
/Best regards, Mats
Created 12-27-2016 03:19 PM
a)I am starting flume agent using below command.In production how we will trigger this command currently I am running manually on unix command prompt and also i want to create dependeny with hive?
b)can i place below command in unix shell script and call it in shell action in oozie?
flume-ng agent --conf $FLUME_CONF_DIR --conf-file $FLUME_CONF_DIR/flume.conf --name Agent7
Created 12-29-2016 12:09 PM
Any input on my clarifications
Created 12-29-2016 02:22 PM
@vamsi valiveti you can trigger Flume from the oozie shell action. However pay attention that action will be executed on random cluster node, so all your nodes should have Flume installed. Also you will need to somehow control the agents after that, and if you have >10 nodes it became a problem.. That's why is not common scenario of flume usage.
I'd say the good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.
Created 12-29-2016 02:28 PM
HI @Michael M
Thanks alot for your time.one small clarification
You mentioned good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.
clarification 1:-
How to keep Flume running all the time?currently i am using below command on my gateway node.
flume-ng agent --conf $FLUME_CONF_DIR --conf-file $FLUME_CONF_DIR/flume.conf --name Agent7
Created 12-29-2016 02:40 PM
@vamsi valiveti the easiest way is to detach shell from the command using nohup:
nohup <my_command> &
Another option is to create flume init.d service script. I've posted some example script here (search for "Setup flume agent auto startup" on the page), and run the flume as a service.
Created 12-29-2016 03:07 PM
HI @Michael M
For first option:-
In production can I place below command in shell script and schedule that script using crontab so that it will run the Flume will run continuously since In production environment we are not allowed to run any command manually on gateway node.Please correct me if i am wrong?
nohup <my_command> &
Created 12-29-2016 03:51 PM
@vamsi valiveti it could be the option, right.
But for production usage i'd think additionally about how to stop the agents and how to monitor the agent. From my experience init.d service script + ganglia monitoring is a best option.
It allows you to run/stop agents easily with the commands like: /etc/init.d/flume "agent" stop/start. And ganglia provides a nice web interface for the monitoring.