Created on 04-16-2016 09:52 AM - edited 09-16-2022 03:14 AM
Hello, I'm using CDH 5.5.1 with Spark 1.5.0.
I'm unsuccessfully trying to execute a simple Spark action (Python script) via Oozie. As for now I just want to be able to run something at all, the script is still a silly example, it doesn't really do anything. It is as follows:
## IMOPORT FUNCTIONS from pyspark.sql.functions import * ## CREATE MAIN DATAFRAME eventi_DF = sqlContext.table("eventi")
I created a simple Oozie Workflow from Hue GUI. I used the following settings for the Spark action:
SPARK MASTER: yarn-cluster MODE: cluster APP NAME: MySpark JARS / PY FILES: lib/test.py MAIN CLASS: org.apache.spark.examples.mllib.JavaALS ARGUMENTS: <No Arguments Defined>
I've uploaded the Script in HDFS under the Workspace "/user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib" directory, and I'm sure it gets picked up (as just to understand it was meant to be put in this directory I had to work a little bit, fighting a "test.py" not found" exception, that now is not there anymore).
As of now, when I try to run the Workflow by pressing the "Play" button on GUI, this is what I get in the Action Logfile:
>>> Invoking Spark class now >>> <<< Invocation of Main class completed <<< Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, key not found: SPARK_HOME java.util.NoSuchElementException: key not found: SPARK_HOME at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.deploy.yarn.Client$$anonfun$findPySparkArchives$2.apply(Client.scala:943) at org.apache.spark.deploy.yarn.Client$$anonfun$findPySparkArchives$2.apply(Client.scala:942) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.deploy.yarn.Client.findPySparkArchives(Client.scala:942) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:630) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:124) at org.apache.spark.deploy.yarn.Client.run(Client.scala:914) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:973) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:185) at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:176) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:49) at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:46) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:236) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:378) at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:296) at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:181) at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:224) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Oozie Launcher failed, finishing Hadoop job gracefully Oozie Launcher, uploading action data to HDFS sequence file: hdfs://mnethdp01.glistencube.com:8020/user/admin/oozie-oozi/0000000-160416120358569-oozie-oozi-W/spark-3ba6--spark/action-data.seq Oozie Launcher ends
Now, I guess the problem is:
Failing Oozie Launcher, ... key not found: SPARK_HOME
I've been trying hard to set this SPARK_HOME Key in different places. Things I've tried include the following:
Spark Service Environment Advanced Configuration Snippet (Safety Valve): SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark
Oozie Service Environment Advanced Configuration Snippet (Safety Valve) SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark
SPARK MASTER: local[*] SPARK MODE: client SPARK MASTER: yarn-cluster SPARK MODE: cluster SPARK MASTER: yarn-client SPARK MODE: client
spark.yarn.appMasterEnv.SPARK_HOME /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark spark.executorEnv.SPARK_HOME /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark
All the above to no success. Apparently I'm not able to set the required key anywhere.
What am I doing wrong? Isn't this meant to be pretty straightforward? Thanks in advance for any insights.
Created 05-13-2016 02:45 AM
Hi
either in your Hue Oozie workflow editor UI (workflow settings -> Hadoop Properties)
or on your workflow.xml
<workflow-app name="Workflow name" xmlns="uri:oozie:workflow:0.5">
<global>
<configuration>
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.env</name>
<value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/</value>
</property>
</configuration>
</global>
.....
Created 05-12-2016 05:34 PM
Created 05-13-2016 02:42 AM
Hi Ben, thanks a whole lot for your reply.
May I ask you where exactly you specified that setting?
- In the GUI, in some particular field?
- In "workflow.xml", in the Job's directory in HDFS? If yes: as an "arg", as a "property", or..?
- In "job.properties", in the Job's directory in HDFS? If yes: how?
- In some other file? E.g. "/etc/alternatives/spark-conf/spark-defaults.conf"? If yes, how?
A snippet of your code would be extremely appreciated!
I'm asking you because I've tried all of the above with your suggestion but I did not succeed.
Thanks again for your help
Created 05-13-2016 02:45 AM
Hi
either in your Hue Oozie workflow editor UI (workflow settings -> Hadoop Properties)
or on your workflow.xml
<workflow-app name="Workflow name" xmlns="uri:oozie:workflow:0.5">
<global>
<configuration>
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.env</name>
<value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/</value>
</property>
</configuration>
</global>
.....
Created 05-13-2016 03:03 AM
I got past this! Still no cigar, though. Now I have another error, but I'm going to work on this. It's something different now...
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist java.io.FileNotFoundException: File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist
Many thanks for your help. I'd never be able to figure this out by myself!
Created 10-17-2016 01:12 PM
I'm getting this error also. Have you managed to solve it?
Created 10-17-2016 01:40 PM
Hi aj,
yes I did manage to solve it. Please, take a look at the following thread and see if it can be of help. It may seem a bit unrelated from the "test.py not found" issue, but it contains detailed info about how to specify all the needed parameters to let the whole thing run smoothly:
HTH
Created 10-17-2016 02:18 PM
Ah, my error was not using HDFS: for the .py. Thanks!