Support Questions

Find answers, ask questions, and share your expertise

scheduling a spark-submit job using oozie

avatar
Expert Contributor

I am new in Oozie. I am using Hue 2.6.1-2950 and Oozie 4.2. I develop a spark program in java which gets the data from kafka topic and save them in hive table. I pass my arguments to my .ksh script to submit the job. It works perfect however, I have no idea how to schedule this using oozie and hue to run every 5 minutes. I have a jar file which is my java code, I have a consumer.ksh which gets the arguments from my configuration file and run my jar file using spark-submit command. Please give me suggestion how to this.

1 ACCEPTED SOLUTION

avatar
Master Guru

Hello Hoda,

there are essentially three ways. Spark Action, ssh action and shell action

1) There is a spark action for oozie but its new and not yet supported by HDP. So you would need to install it.

https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html

Another problem is that hue does not support the spark action so you would need to manually kick off the workflow. ( You can still monitor, start, stop etc. the coordinator and action in hue but you couldn't use the hue editor to create it )

To be honest I prefer to do the development of the workflow and coordinator.xml in Eclipse ( or any XML editor ) and then kick them off using the oozie command line. Creating a coordinator in the hue web interface is torture. Changing the XML is much easier. However it is amazing to use Hue for the monitoring, and interaction.

There is also the way to run a shell or ssh action in oozie.

2) ssh means that you would have the same environment you currently have. Might be the easiest way going forward.

This essentially means that oozie ssh into your spark client and runs any command you want. You can specify parameters as well which are given to the ssh command and you can read the results from your ksh file by providing something like

echo result=SUCCESS ( you can then use that in oozie using capture-output if you needed that )

https://oozie.apache.org/docs/3.2.0-incubating/DG_SshActionExtension.html

The bad thing here is that you have a single point of failure and you need to add keyless ssh login from oozie to your user account ( essentially doing a ssh-keygen and then adding the public key of the oozie user from the oozie server to the authorized_users file of the spark client account )

3) shell action

https://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html

This is a bit cleaner but more complex than the ssh action. Its very similar but the shell action is executed on some datanodes. You don't know in advance. And you need to give everything with it. You may have to add the spark jars you need to the action execution using the <files> tag. So you shoild definitely use an assembly for your app and add the spark assembly as well.

View solution in original post

8 REPLIES 8

avatar

Hue doesn't have spark action to schedule.

avatar
Expert Contributor

Hue version has spark-submit. So there is not any way to do it in Huw 2.6? @Divakar Annapureddy

avatar

I think you need to put all class paths in shell script and create shell action.

avatar
Master Guru

Hello Hoda,

there are essentially three ways. Spark Action, ssh action and shell action

1) There is a spark action for oozie but its new and not yet supported by HDP. So you would need to install it.

https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html

Another problem is that hue does not support the spark action so you would need to manually kick off the workflow. ( You can still monitor, start, stop etc. the coordinator and action in hue but you couldn't use the hue editor to create it )

To be honest I prefer to do the development of the workflow and coordinator.xml in Eclipse ( or any XML editor ) and then kick them off using the oozie command line. Creating a coordinator in the hue web interface is torture. Changing the XML is much easier. However it is amazing to use Hue for the monitoring, and interaction.

There is also the way to run a shell or ssh action in oozie.

2) ssh means that you would have the same environment you currently have. Might be the easiest way going forward.

This essentially means that oozie ssh into your spark client and runs any command you want. You can specify parameters as well which are given to the ssh command and you can read the results from your ksh file by providing something like

echo result=SUCCESS ( you can then use that in oozie using capture-output if you needed that )

https://oozie.apache.org/docs/3.2.0-incubating/DG_SshActionExtension.html

The bad thing here is that you have a single point of failure and you need to add keyless ssh login from oozie to your user account ( essentially doing a ssh-keygen and then adding the public key of the oozie user from the oozie server to the authorized_users file of the spark client account )

3) shell action

https://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html

This is a bit cleaner but more complex than the ssh action. Its very similar but the shell action is executed on some datanodes. You don't know in advance. And you need to give everything with it. You may have to add the spark jars you need to the action execution using the <files> tag. So you shoild definitely use an assembly for your app and add the spark assembly as well.

avatar
Master Guru

@hoda moradi slightly unrelated the following might be useful a bit. I wrote a sample applicatoon that does some parsing and processing in Spark

https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html

avatar
Expert Contributor

thanks, it is really helpful for beginners like me.

avatar
Master Mentor

Hoda, just a comment, please place your questions in proper tracks, your questions always directed in community track which judging by your inquiries fall into Spark, streaming and data processing. Community track only applies to HCC specific questions. You will get better responses faster too.

avatar
New Contributor

Hi, I am running spark submit command with oozie workflow, but getting error Mainclass[org.apache.oozie.action.hadoop],exit code [1]

I just wanted to confirm, if i need to give the HDFS path of jar and keytab in spark submit

Thanks in advance!!