Welcome back folks, in this tutorial, I'm going to demonstrate how to easily import existing Spark workflows and execute them in WFM as well as create your own Spark workflows. As of today, Apache Spark 2.x is not supported in Apache Oozie bundled with HDP. There is community work around making Spark2 run in Oozie but it is not released yet. I'm going to concentrate on Spark 1.6.3 today.
Luckily for Spark action in Kerberos environment I didn't need to add anything else (i.e. credential).
First thing I need is to get dfs.nameservices property from HDFS
Ambari > HDFS > Configs
I'm going to use that for nameNode variable.
I'm ready to import this workflow into WFM, for the details, please review one of my earlier tutorials.
I'm presented with spark action node
Click on the spark-node and hit the gear icon to preview the properties.
let's also review any arguments for input and output as well as RM and NameNode, also notice prepare step, we can select to delete a directory if exists.
We're going to leave everything as is.
When we submit the workflow, we're going to supply nameNode and resourceManager address, below are my properties
notice jobTracker and resourceManager both appear, ignore jobTracker, since it was in the original wf, it was inherited, we're concerned about RM going forward. Also nameNode value is the dfs.nameservices property from core-site.xml as I stated earlier.
Once the job completes, you can navigate to the output directory and see that file was copied.
In my case sample input was a book in the examples directory
hdfs dfs -cat /user/aervits/examples/output-data/spark/part-00000
To be or not to be, that is the question;
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing, end them. To die, to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to ? 'tis a consummation
next, let's add a spark action to WFM and edit it. Fill out the properties as below and make sure to select Yarn Cluster, Yarn Client in Oozie will be deprecated soon. Notice you can pass Spark options on its own line.
I also need to add an argument to SparkPi job, in this case it's 10
If you didn't figure out already, I'm trying to recreate the following command in Oozie