Created 08-31-2022 05:44 AM
Hi
I am using the cloudera hortonworks sandbox docker image, and have followed this tutorial to run Jupyter notebooks: https://community.cloudera.com/t5/Support-Questions/Installing-Jupyter-on-sandbox/td-p/201683
This works. The notebook is started using the python kernal. The error is encountered when attempting to create the spark context:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/hdp/current/spark-client/./bin/spark-submit'
FileNotFoundError Traceback (most recent call last)
<ipython-input-4-fbb9eeb69493> in <module>
----> 1 spark = SparkSession.builder.master("local").appName("myApp").getOrCreate()
/usr/local/lib/python3.6/site-packages/pyspark/sql/session.py in getOrCreate(self)
226 sparkConf.set(key, value)
227 # This SparkContext may be an existing one.
--> 228 sc = SparkContext.getOrCreate(sparkConf)
229 # Do not update `SparkConf` for existing `SparkContext`, as it's shared
230 # by all sessions.
/usr/local/lib/python3.6/site-packages/pyspark/context.py in getOrCreate(cls, conf)
390 with SparkContext._lock:
391 if SparkContext._active_spark_context is None:
--> 392 SparkContext(conf=conf or SparkConf())
393 return SparkContext._active_spark_context
394
/usr/local/lib/python3.6/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
142 " is not allowed as it is a security risk.")
143
--> 144 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
145 try:
146 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/usr/local/lib/python3.6/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
337 with SparkContext._lock:
338 if not SparkContext._gateway:
--> 339 SparkContext._gateway = gateway or launch_gateway(conf)
340 SparkContext._jvm = SparkContext._gateway.jvm
341
/usr/local/lib/python3.6/site-packages/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
96 signal.signal(signal.SIGINT, signal.SIG_IGN)
97 popen_kwargs['preexec_fn'] = preexec_func
---> 98 proc = Popen(command, **popen_kwargs)
99 else:
100 # preexec_fn not supported on Windows
/usr/lib64/python3.6/subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
727 c2pread, c2pwrite,
728 errread, errwrite,
--> 729 restore_signals, start_new_session)
730 except:
731 # Cleanup if the child failed starting.
/usr/lib64/python3.6/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
1362 if errno_num == errno.ENOENT:
1363 err_msg += ': ' + repr(err_filename)
-> 1364 raise child_exception_type(errno_num, err_msg, err_filename)
1365 raise child_exception_type(err_msg)
1366
FileNotFoundError: [Errno 2] No such file or directory: '/usr/hdp/current/spark-client/./bin/spark-submit': '/usr/hdp/current/spark-client/./bin/spark-submit'
I think the problem might be connected to the environment variables, but as a novice I don't know.
Global:
HOSTNAME=sandbox-hdp.hortonworks.com
TERM=xterm
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
SHLVL=1
HOME=/root
container=docker
_=/usr/bin/printenv
start_jupyter.sh
export SPARK_HOME=/usr/hdp/current/spark-client
export HADOOP_HOME=/usr/hdp/current/hadoop-client
export HADOOP_CONF_DIR=/usr/hdp/current/hadoop-client/conf
export PYTHONPATH="/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip"
export PYTHONSTARTUP=/usr/hdp/current/spark-client/python/pyspark/shell.py
export PYSPARK_SUBMIT_ARGS="--master yarn-client pyspark-shell"
Is anyone able to point me in the right direction, so that I can create SparkContext?
Many thanks
Created 09-21-2022 10:22 PM
Hi @Boron
Could you please set the spark-home environment variable like below before creating spark-session.
import os
os.environ['SPARK_HOME'] = '/usr/hdp/current/spark-client'
Reference:
Created 09-21-2022 11:38 PM
Hello @Boron
I believe you are using HDP 3.x. Note that there is no Spark 1.x available in HDP 3. We need to use Spark 2.x. Set the SPARK_HOME to Spark 2.
export SPARK_HOME=/usr/hdp/current/spark2-client