Created 02-27-2017 09:27 PM
I'm looking for some tips regarding using Oozie shell action on Hadoop cluster.
For example,let's say I run a Python code on our edge node that runs some Sqoop and HDFS commands against Hadoop cluster and also writes some local files on the edge node temporarily. If I want to use Oozie to schedule the Python script, all these commands will be run on Hadoop cluster, meaning all the python libraries I use in my code should be on data nodes. Also Python needs to write files on data nodes' file system (not HDFS), which shouldn't be allowed. How to tackle these issues? How permission and security of shell commands by Oozie are handled?
Any tips/suggestion/links would be greatly appreciated.
Created 02-27-2017 09:36 PM
for an example to run Python2 and Python3 please see my articles https://community.hortonworks.com/articles/82967/apache-ambari-workflow-designer-view-for-apache-oo....
this covers a new workflow editing tool called Workflow Manager but same steps can be applied to writing pure XML workflows. Requirement here is that all Python libs should be available on every nodemanager. If you're on Kerborized cluster, Oozie will proxy the user permissions to user executing Oozie process, so paying attention to permissions across the whole workflow life cycle is also important. For good measure here's my article on shell action alone
Let me know if you run into any problems.
Created 02-27-2017 09:36 PM
for an example to run Python2 and Python3 please see my articles https://community.hortonworks.com/articles/82967/apache-ambari-workflow-designer-view-for-apache-oo....
this covers a new workflow editing tool called Workflow Manager but same steps can be applied to writing pure XML workflows. Requirement here is that all Python libs should be available on every nodemanager. If you're on Kerborized cluster, Oozie will proxy the user permissions to user executing Oozie process, so paying attention to permissions across the whole workflow life cycle is also important. For good measure here's my article on shell action alone
Let me know if you run into any problems.
Created 02-28-2017 11:09 PM
You typically use Oozie shell action as a connector or a data provider between 2 other Oozie actions in your workflow. You can include "capture-output" element in your shell action enabling the next action to read the output from the shell command/script. If all you want to run in Oozie is your Python script, like it seems to be your case, then it's better to use cron and schedule your script to run on a particular node in the cluster. Or you can try to "port" your Python script to Oozie by creating a Sqoop action followed by FS actions to run HDFS commands. Oozie offers many actions you can choose from to develop your apps on Hadoop. See details here, in particular "Workflow functional specs" and Action extensions.