Support Questions

mgavrilescu · ‎02-25-2015

I have written a Spark application in python and successfully tested it. I run it with spark-submit in command line.

Everything seemes to work fine and I get the expected output.

The problem is, when I try to schedule my application through crontab, to run every 5 minutes, it fails with the following error:

/u01/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/bin/compute-classpath.sh: line 64: hadoop: command not found
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:313)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:207)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:59)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 5 more

It looks to me that crontab is not able to load the environment variables where I store all the paths to the jars (the hadoop classpath is missing when the script is launched by crontab). Did anyone encountered this issue? I tried some of these solutions: http://unix.stackexchange.com/questions/27289/how-can-i-run-a-cron-command-with-existing-environment...

hadooptom · ‎02-26-2015

Try adding

source /etc/hadoop/conf/hadoop-env.sh

source /etc/spark/conf/spark-env.sh

to the top of a shell-script that submits your Spark job. Don't have a VM with Spark / Hadoop handy right now, but IIRC that's what I've needed to do in the past.

View solution in original post

srowen · ‎02-25-2015

Whatever user you are running this as doesn't seem to have the PATH or env variables set up. See the first error:

hadoop: command not found

mgavrilescu · ‎02-26-2015

Thank you Sowen for the reply but actually I was saying that the hadoop classpath & other is missing only when the script is launched by crontab. I have no problems when I launch the script manually.

srowen · ‎02-26-2015

Right. What user is used in each case?

mgavrilescu · ‎02-26-2015

In each of the 2 cases I use the same user (my user name). To define the scheduling of the crontab job I use "crontab -e" under my user.

srowen · ‎02-26-2015

Is some of the environment setup only happening in your shell config that is triggered for interactive shells?

The problem is fairly clear -- env not setup, and the question is why, but it's not really a Spark issue per se.

hadooptom · ‎02-26-2015

Try adding

source /etc/hadoop/conf/hadoop-env.sh

source /etc/spark/conf/spark-env.sh

to the top of a shell-script that submits your Spark job. Don't have a VM with Spark / Hadoop handy right now, but IIRC that's what I've needed to do in the past.

mgavrilescu · ‎02-27-2015

Thank you all!

I have re-set the env variables in crontab as you suggested. It seems to work fine!

Sarthak · ‎11-21-2015

Hi,

I am trying to schedule a spark job using cron.

I have made a shell script and it executes well on the terminal.

However, when I execute the script using cron it gives me insufficient memory to start JVM thread error.

Every time I start the script using terminal there is no issue. This issue comes when the script starts with cron.

Kindly if you could suggest something.

Cloudera Community

Support Questions

Scheduling Spark with Crontab

Spark 3 legacy configurations list ( Spark 2 behav...

Spark Python Supportability Matrix

Spark and Java versions Supportability Matrix

Spark Scala Version Compatibility Matrix

Spark Memory Management

Dynamic Allocation in Apache Spark

Spark Python Integration Test Result Exceptions

Apache Spark and Iceberg Supportability Matrix

Spark Streaming Graceful Shutdown - Part1

Spark - YARN Capacity Scheduler