Created on 02-25-2015 09:27 AM - edited 09-16-2022 02:22 AM
I have written a Spark application in python and successfully tested it. I run it with spark-submit in command line.
Everything seemes to work fine and I get the expected output.
The problem is, when I try to schedule my application through crontab, to run every 5 minutes, it fails with the following error:
/u01/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/bin/compute-classpath.sh: line 64: hadoop: command not found
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:313)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:207)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:59)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 5 more
It looks to me that crontab is not able to load the environment variables where I store all the paths to the jars (the hadoop classpath is missing when the script is launched by crontab). Did anyone encountered this issue? I tried some of these solutions: http://unix.stackexchange.com/questions/27289/how-can-i-run-a-cron-command-with-existing-environment...
Created 02-26-2015 11:52 PM
Try adding
source /etc/hadoop/conf/hadoop-env.sh
source /etc/spark/conf/spark-env.sh
to the top of a shell-script that submits your Spark job. Don't have a VM with Spark / Hadoop handy right now, but IIRC that's what I've needed to do in the past.
Created 02-25-2015 10:37 AM
Whatever user you are running this as doesn't seem to have the PATH or env variables set up. See the first error:
hadoop: command not found
Created 02-26-2015 12:45 AM
Thank you Sowen for the reply but actually I was saying that the hadoop classpath & other is missing only when the script is launched by crontab. I have no problems when I launch the script manually.
Created 02-26-2015 01:23 AM
Right. What user is used in each case?
Created 02-26-2015 01:26 AM
In each of the 2 cases I use the same user (my user name). To define the scheduling of the crontab job I use "crontab -e" under my user.
Created 02-26-2015 01:28 AM
Is some of the environment setup only happening in your shell config that is triggered for interactive shells?
The problem is fairly clear -- env not setup, and the question is why, but it's not really a Spark issue per se.
Created 02-26-2015 11:52 PM
Try adding
source /etc/hadoop/conf/hadoop-env.sh
source /etc/spark/conf/spark-env.sh
to the top of a shell-script that submits your Spark job. Don't have a VM with Spark / Hadoop handy right now, but IIRC that's what I've needed to do in the past.
Created 02-27-2015 06:28 AM
Thank you all!
I have re-set the env variables in crontab as you suggested. It seems to work fine!
Created 11-21-2015 05:38 AM
Hi,
I am trying to schedule a spark job using cron.
I have made a shell script and it executes well on the terminal.
However, when I execute the script using cron it gives me insufficient memory to start JVM thread error.
Every time I start the script using terminal there is no issue. This issue comes when the script starts with cron.
Kindly if you could suggest something.