I'm using CDH 5.14.
In the CM in host config I have set Parcel Directory = /opt/cloudera/parcels
in the system: /opt/cloudera -> /data/cloudera
In the spark config files, spark-env.sh and spark-defaults everywhere /opt/cloudera is replaced to /data/cloudera.
Is it normal behaviour? Can't parcel dir be configured as symlink path?
Interesting. It appears the Spark2 CSD has the common.sh do this on purpose:
# Make sure PARCELS_ROOT is in the format we expect, canonicalized and without a trailing slash. export PARCELS_ROOT=$(readlink -m "$PARCELS_ROOT")
So this is expected given your description.
"readlink -m will read any number of links and return the actual dir/file.
No, common.sh does not belong in /etc/spark2/conf
common.sh is located in your CSD
jar tvf SPARK2_ON_YARN-2.3.0.cloudera5-SNAPSHOT.jar 0 Tue Jul 23 23:16:06 PDT 2019 META-INF/ 69 Tue Jul 23 23:16:06 PDT 2019 META-INF/MANIFEST.MF 0 Tue Jul 23 23:16:06 PDT 2019 descriptor/ 25781 Tue Jul 23 23:16:06 PDT 2019 descriptor/service.sdl 0 Wed Jul 17 20:46:54 PDT 2019 aux/ 0 Wed Jul 17 20:46:54 PDT 2019 aux/client/ 2224 Wed Jul 17 20:46:54 PDT 2019 aux/client/spark-env.sh 0 Wed Jul 17 20:46:54 PDT 2019 images/ 3312 Wed Jul 17 20:46:54 PDT 2019 images/icon.png 0 Wed Jul 17 20:46:54 PDT 2019 scripts/ 19696 Wed Jul 17 20:46:54 PDT 2019 scripts/common.sh 1884 Wed Jul 17 20:46:54 PDT 2019 scripts/control.sh 0 Wed Jul 17 23:17:24 PDT 2019 meta/ 24 Wed Jul 17 23:17:24 PDT 2019 meta/version
I'm curious why you are asking this question. Are you seeing a problem and trying to solve it?
the problem is on one host where spark2 gateway runs we have parcels installed in /data/cloudera/parcels path. On the rest of the cluster, worker nodes included parcels are in /opt/cloudera/parcels. So on that one we've symlinked /opt/cloudera -> /data/cloudera.
When we run spark code in yarn mode we get following errors:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, lxspkop010.at.inside, executor 1): org.apache.spark.SparkException: Error from python worker: /bin/python2: No module named pyspark PYTHONPATH was: /opt/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/jars/spark-core_2.11-2.2.0.cloudera4.jar:/data/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/lib/py4j-0.10.7-src.zip:/data/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/::/data/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/lib/py4j-0.10.7-src.zip:/data/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/lib/pyspark.zip:/data/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/lib/py4j-0.10.7-src.zip:/data/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/lib/pyspark.zip java.io.EOFException
On the other gateways all is fine. The problem appears only for RDD API. Parcel directory is set to /opt/cloudera/parcels on that host configuration.
Not sure of the elegant solution here, but I wonder if you were to add any paths that traverse the link with the "real" path to the following in the Spark2 config in CM:
Extra Python Path
In there error, maybe iterate over the PYTHONPATH items and make sure they exit.
Apart from that, perhaps an "strace" on your client to find out what it is looking for and where it is not finding it.
I see in your error that the pyspark module cannot be found, but it is not clear why just in that error information (at least to me)