Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (2)
avatar
Master Guru

Please follow below steps for running SparkR script via Oozie

.

1. Install R packages on all the node managers

yum -y install R R-devel libcurl-devel openssl-devel

.

2. Keep your R script ready

Here is the sample script

library(SparkR) 
sc <- sparkR.init(appName="SparkR-sample") 
sqlContext <- sparkRSQL.init(sc) 
localDF <- data.frame(name=c("ABC", "blah", "blah"), age=c(39, 32, 81)) 
df <- createDataFrame(sqlContext, localDF) 
printSchema(df) 
sparkR.stop() 

.

3. Create workflow.xml

Here is the working example:

<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'> 
<global> 
<configuration> 
<property> 
<name>oozie.launcher.yarn.app.mapreduce.am.env</name> 
<value>SPARK_HOME=/usr/hdp/2.5.3.0-37/spark</value> 
</property> 
<property> 
<name>oozie.launcher.mapred.child.env</name> 
<value>SPARK_HOME=/usr/hdp/2.5.3.0-37/spark</value> 
</property> 
</configuration> 
</global> 
<start to='spark-node' /> 
<action name='spark-node'> 
<spark xmlns="uri:oozie:spark-action:0.1"> 
<job-tracker>${jobTracker}</job-tracker> 
<name-node>${nameNode}</name-node> 
<prepare> 
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/spark"/> 
</prepare> 
<master>${master}</master> 
<name>SparkR</name> 
<jar>${nameNode}/user/${wf:user()}/spark.R</jar> 
<spark-opts>--driver-memory 512m --conf spark.driver.extraJavaOptions=-Dhdp.version=2.5.3.0</spark-opts> 
</spark> 
<ok to="end" /> 
<error to="fail" /> 
</action> 
<kill name="fail"> 
<message>Workflow failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}] 
</message> 
</kill> 
<end name='end' /> 
</workflow-app> 

.

4. Make sure that you don't have sparkr.zip in workflow/lib directory or Oozie sharelib or in <file> tag in the workflow, or else it will cause conflicts.

.

Upload workflow to hdfs and run it. It should work. This has been successfully tested on HDP-2.5.X & HDP-2.6.X

.

Please comment if you have any feedback/questions/suggestions.

Happy Hadooping!! :)

Reference - https://developer.ibm.com/hadoop/2017/06/30/scheduling-spark-job-written-pyspark-sparkr-yarn-oozie

1,965 Views