Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Scheduling Spark with Crontab

avatar
Explorer

I have written a Spark application in python and successfully tested it. I run it with spark-submit in command line.

Everything seemes to work fine and I get the expected output.

The problem is, when I try to schedule my application through crontab, to run every 5 minutes, it fails with the following error: 

 

/u01/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/bin/compute-classpath.sh: line 64: hadoop: command not found
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:313)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:207)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:59)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 5 more

 

It looks to me that crontab is not able to load the environment variables where I store all the paths to the jars (the hadoop classpath is missing when the script is launched by crontab). Did anyone encountered this issue? I tried some of these solutions: http://unix.stackexchange.com/questions/27289/how-can-i-run-a-cron-command-with-existing-environment...

 

 

1 ACCEPTED SOLUTION

avatar
Contributor

Try adding 

 

source /etc/hadoop/conf/hadoop-env.sh

source /etc/spark/conf/spark-env.sh

 

to the top of a shell-script that submits your Spark job.  Don't have a VM with Spark / Hadoop handy right now, but IIRC that's what I've needed to do in the past.

 

View solution in original post

8 REPLIES 8

avatar
Master Collaborator

Whatever user you are running this as doesn't seem to have the PATH or env variables set up. See the first error:

 

hadoop: command not found

avatar
Explorer

Thank you Sowen for the reply but actually I was saying that the hadoop classpath & other is missing only when the script is launched by crontab. I have no problems when I launch the script manually.  

avatar
Master Collaborator

Right. What user is used in each case?

avatar
Explorer

In each of the 2 cases I use the same user (my user name). To define the scheduling of the crontab job I use "crontab -e" under my user. 

avatar
Master Collaborator

Is some of the environment setup only happening in your shell config that is triggered for interactive shells?

The problem is fairly clear -- env not setup, and the question is why, but it's not really a Spark issue per se.

avatar
Contributor

Try adding 

 

source /etc/hadoop/conf/hadoop-env.sh

source /etc/spark/conf/spark-env.sh

 

to the top of a shell-script that submits your Spark job.  Don't have a VM with Spark / Hadoop handy right now, but IIRC that's what I've needed to do in the past.

 

avatar
Explorer

Thank you all!

 

I have re-set the env variables in crontab as you suggested. It seems to work fine!

 

 

avatar
Explorer

Hi,

 

I am trying to schedule a spark job using cron.

I have made a shell script and it executes well on the terminal.

 

However, when I execute the script using cron it gives me insufficient memory to start JVM thread error.

 

Every time I start the script using terminal there is no issue. This issue comes when the script starts with cron.

Kindly if you could suggest something.