Created 04-01-2016 08:25 PM
I am new in Oozie. I am using Hue 2.6.1-2950 and Oozie 4.2. I develop a spark program in java which gets the data from kafka topic and save them in hive table. I pass my arguments to my .ksh script to submit the job. It works perfect however, I have no idea how to schedule this using oozie and hue to run every 5 minutes. I have a jar file which is my java code, I have a consumer.ksh which gets the arguments from my configuration file and run my jar file using spark-submit command. Please give me suggestion how to this.
Created 04-01-2016 08:48 PM
Hello Hoda,
there are essentially three ways. Spark Action, ssh action and shell action
1) There is a spark action for oozie but its new and not yet supported by HDP. So you would need to install it.
https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
Another problem is that hue does not support the spark action so you would need to manually kick off the workflow. ( You can still monitor, start, stop etc. the coordinator and action in hue but you couldn't use the hue editor to create it )
To be honest I prefer to do the development of the workflow and coordinator.xml in Eclipse ( or any XML editor ) and then kick them off using the oozie command line. Creating a coordinator in the hue web interface is torture. Changing the XML is much easier. However it is amazing to use Hue for the monitoring, and interaction.
There is also the way to run a shell or ssh action in oozie.
2) ssh means that you would have the same environment you currently have. Might be the easiest way going forward.
This essentially means that oozie ssh into your spark client and runs any command you want. You can specify parameters as well which are given to the ssh command and you can read the results from your ksh file by providing something like
echo result=SUCCESS ( you can then use that in oozie using capture-output if you needed that )
https://oozie.apache.org/docs/3.2.0-incubating/DG_SshActionExtension.html
The bad thing here is that you have a single point of failure and you need to add keyless ssh login from oozie to your user account ( essentially doing a ssh-keygen and then adding the public key of the oozie user from the oozie server to the authorized_users file of the spark client account )
3) shell action
https://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html
This is a bit cleaner but more complex than the ssh action. Its very similar but the shell action is executed on some datanodes. You don't know in advance. And you need to give everything with it. You may have to add the spark jars you need to the action execution using the <files> tag. So you shoild definitely use an assembly for your app and add the spark assembly as well.
Created 04-01-2016 08:37 PM
Hue doesn't have spark action to schedule.
Created 04-01-2016 08:48 PM
Hue version has spark-submit. So there is not any way to do it in Huw 2.6? @Divakar Annapureddy
Created 04-01-2016 08:43 PM
I think you need to put all class paths in shell script and create shell action.
Created 04-01-2016 08:48 PM
Hello Hoda,
there are essentially three ways. Spark Action, ssh action and shell action
1) There is a spark action for oozie but its new and not yet supported by HDP. So you would need to install it.
https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
Another problem is that hue does not support the spark action so you would need to manually kick off the workflow. ( You can still monitor, start, stop etc. the coordinator and action in hue but you couldn't use the hue editor to create it )
To be honest I prefer to do the development of the workflow and coordinator.xml in Eclipse ( or any XML editor ) and then kick them off using the oozie command line. Creating a coordinator in the hue web interface is torture. Changing the XML is much easier. However it is amazing to use Hue for the monitoring, and interaction.
There is also the way to run a shell or ssh action in oozie.
2) ssh means that you would have the same environment you currently have. Might be the easiest way going forward.
This essentially means that oozie ssh into your spark client and runs any command you want. You can specify parameters as well which are given to the ssh command and you can read the results from your ksh file by providing something like
echo result=SUCCESS ( you can then use that in oozie using capture-output if you needed that )
https://oozie.apache.org/docs/3.2.0-incubating/DG_SshActionExtension.html
The bad thing here is that you have a single point of failure and you need to add keyless ssh login from oozie to your user account ( essentially doing a ssh-keygen and then adding the public key of the oozie user from the oozie server to the authorized_users file of the spark client account )
3) shell action
https://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html
This is a bit cleaner but more complex than the ssh action. Its very similar but the shell action is executed on some datanodes. You don't know in advance. And you need to give everything with it. You may have to add the spark jars you need to the action execution using the <files> tag. So you shoild definitely use an assembly for your app and add the spark assembly as well.
Created 04-07-2016 03:56 PM
@hoda moradi slightly unrelated the following might be useful a bit. I wrote a sample applicatoon that does some parsing and processing in Spark
https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html
Created 04-07-2016 05:07 PM
thanks, it is really helpful for beginners like me.
Created 04-07-2016 06:25 PM
Hoda, just a comment, please place your questions in proper tracks, your questions always directed in community track which judging by your inquiries fall into Spark, streaming and data processing. Community track only applies to HCC specific questions. You will get better responses faster too.
Created 12-22-2016 06:14 PM
Hi, I am running spark submit command with oozie workflow, but getting error Mainclass[org.apache.oozie.action.hadoop],exit code [1]
I just wanted to confirm, if i need to give the HDFS path of jar and keytab in spark submit
Thanks in advance!!