Support Questions

Find answers, ask questions, and share your expertise

Cannot get pyspark to work (Creating Spark Context) with FileNotFoundError: [Errno 2] No such file or directory: '/usr/hdp/current/spark-client/./bin/spark-submit'

avatar
New Contributor

Hi

 

I am using the cloudera hortonworks sandbox docker image, and have followed this tutorial to run  Jupyter notebooks: https://community.cloudera.com/t5/Support-Questions/Installing-Jupyter-on-sandbox/td-p/201683

 

This works. The notebook is started using the python kernal. The error is encountered when attempting to create the spark context:

 

FileNotFoundError: [Errno 2] No such file or directory: '/usr/hdp/current/spark-client/./bin/spark-submit'

 

 

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-fbb9eeb69493> in <module>
----> 1 spark = SparkSession.builder.master("local").appName("myApp").getOrCreate()

/usr/local/lib/python3.6/site-packages/pyspark/sql/session.py in getOrCreate(self)
    226                             sparkConf.set(key, value)
    227                         # This SparkContext may be an existing one.
--> 228                         sc = SparkContext.getOrCreate(sparkConf)
    229                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    230                     # by all sessions.

/usr/local/lib/python3.6/site-packages/pyspark/context.py in getOrCreate(cls, conf)
    390         with SparkContext._lock:
    391             if SparkContext._active_spark_context is None:
--> 392                 SparkContext(conf=conf or SparkConf())
    393             return SparkContext._active_spark_context
    394 

/usr/local/lib/python3.6/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    142                 " is not allowed as it is a security risk.")
    143 
--> 144         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    145         try:
    146             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/local/lib/python3.6/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    337         with SparkContext._lock:
    338             if not SparkContext._gateway:
--> 339                 SparkContext._gateway = gateway or launch_gateway(conf)
    340                 SparkContext._jvm = SparkContext._gateway.jvm
    341 

/usr/local/lib/python3.6/site-packages/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
     96                     signal.signal(signal.SIGINT, signal.SIG_IGN)
     97                 popen_kwargs['preexec_fn'] = preexec_func
---> 98                 proc = Popen(command, **popen_kwargs)
     99             else:
    100                 # preexec_fn not supported on Windows

/usr/lib64/python3.6/subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
    727                                 c2pread, c2pwrite,
    728                                 errread, errwrite,
--> 729                                 restore_signals, start_new_session)
    730         except:
    731             # Cleanup if the child failed starting.

/usr/lib64/python3.6/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
   1362                         if errno_num == errno.ENOENT:
   1363                             err_msg += ': ' + repr(err_filename)
-> 1364                     raise child_exception_type(errno_num, err_msg, err_filename)
   1365                 raise child_exception_type(err_msg)
   1366 

FileNotFoundError: [Errno 2] No such file or directory: '/usr/hdp/current/spark-client/./bin/spark-submit': '/usr/hdp/current/spark-client/./bin/spark-submit'

 

 

I think the problem might be connected to the environment variables, but as a novice I don't know.

 

Global:

 

 

HOSTNAME=sandbox-hdp.hortonworks.com
TERM=xterm
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
SHLVL=1
HOME=/root
container=docker
_=/usr/bin/printenv

 

 

start_jupyter.sh

 

 

export SPARK_HOME=/usr/hdp/current/spark-client
export HADOOP_HOME=/usr/hdp/current/hadoop-client
export HADOOP_CONF_DIR=/usr/hdp/current/hadoop-client/conf
export PYTHONPATH="/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip"
export PYTHONSTARTUP=/usr/hdp/current/spark-client/python/pyspark/shell.py
export PYSPARK_SUBMIT_ARGS="--master yarn-client pyspark-shell"

 

 

 

 

Is anyone able to point me in the right direction, so that I can create SparkContext?

 

Many thanks

 

 

 

 

 

 

2 REPLIES 2

avatar
Master Collaborator

Hi @Boron 

 

Could you please set the spark-home environment variable like below before creating spark-session.

import os
os.environ['SPARK_HOME'] = '/usr/hdp/current/spark-client'

Reference:

  1. https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home
  2. https://stackoverflow.com/questions/40087188/cant-find-spark-submit-when-typing-spark-shell

avatar
Contributor

Hello @Boron 
I believe you are using HDP 3.x. Note that there is no Spark 1.x available in HDP 3. We need to use Spark 2.x. Set the SPARK_HOME to Spark 2.

export SPARK_HOME=/usr/hdp/current/spark2-client