Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)
Super Guru

Please follow below steps for running SparkR script via Oozie

.

1. Install R packages on all the node managers

yum -y install R R-devel libcurl-devel openssl-devel

.

2. Keep your R script ready

Here is the sample script

library(SparkR) 
sc <- sparkR.init(appName="SparkR-sample") 
sqlContext <- sparkRSQL.init(sc) 
localDF <- data.frame(name=c("ABC", "blah", "blah"), age=c(39, 32, 81)) 
df <- createDataFrame(sqlContext, localDF) 
printSchema(df) 
sparkR.stop() 

.

3. Create workflow.xml

Here is the working example:

<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'> 
<global> 
<configuration> 
<property> 
<name>oozie.launcher.yarn.app.mapreduce.am.env</name> 
<value>SPARK_HOME=/usr/hdp/2.5.3.0-37/spark</value> 
</property> 
<property> 
<name>oozie.launcher.mapred.child.env</name> 
<value>SPARK_HOME=/usr/hdp/2.5.3.0-37/spark</value> 
</property> 
</configuration> 
</global> 
<start to='spark-node' /> 
<action name='spark-node'> 
<spark xmlns="uri:oozie:spark-action:0.1"> 
<job-tracker>${jobTracker}</job-tracker> 
<name-node>${nameNode}</name-node> 
<prepare> 
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/spark"/> 
</prepare> 
<master>${master}</master> 
<name>SparkR</name> 
<jar>${nameNode}/user/${wf:user()}/spark.R</jar> 
<spark-opts>--driver-memory 512m --conf spark.driver.extraJavaOptions=-Dhdp.version=2.5.3.0</spark-opts> 
</spark> 
<ok to="end" /> 
<error to="fail" /> 
</action> 
<kill name="fail"> 
<message>Workflow failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}] 
</message> 
</kill> 
<end name='end' /> 
</workflow-app> 

.

4. Make sure that you don't have sparkr.zip in workflow/lib directory or Oozie sharelib or in <file> tag in the workflow, or else it will cause conflicts.

.

Upload workflow to hdfs and run it. It should work. This has been successfully tested on HDP-2.5.X & HDP-2.6.X

.

Please comment if you have any feedback/questions/suggestions.

Happy Hadooping!! :)

Reference - https://developer.ibm.com/hadoop/2017/06/30/scheduling-spark-job-written-pyspark-sparkr-yarn-oozie

1,110 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎10-16-2017 10:02 PM
Updated by:
 
Contributors
Top Kudoed Authors