Created on 04-04-2019 11:48 PM
I am capturing the steps to install supplementary spark version on your HDP version. Installing the version not shipped by ambari is unsupported and not recommended, however they are some times customer need it for testing purposes.
Please find the steps below
Here are the steps :-
1. Create the spark user on all nodes. Add it to the hdfs group.
useradd -G hdfs spark
2. Create conf and log directory
mkdir -p /etc/spark2.4/conf mkdir -p /var/log/spark2.4 chmod 755 /var/log/spark2.4/ chown spark:hadoop /var/log/spark2.4 mkdir -p /var/run/spark2.4 chown spark:hadoop /var/run/spark2.4 chmod 775 /var/run/spark2.4 mkdir /var/lib/spark2/ chown spark:spark /var/lib/spark2/
3. Create a directory in /usr/hdp and cd to the dir
mkdir -p /usr/hdp/<your hdp version>/spark2.4 (root user) cd /usr/hdp/<your hdp version>/spark2.4
4. Download the tar file from location http://apache.claz.org/spark/spark-2.4.0/
wget http://apache.claz.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
5. Extract spark tar file
tar -xzvf spark-2.4.0-bin-hadoop2.7.tgz mv spark-2.4.0-bin-hadoop2.7/* . rm -rf spark-2.4.0-bin-hadoop2.7* [Clean up the directory]
6. Change ownership to root
chown root:root /usr/hdp/3.1.0.0-78/spark2.4 (root user)
7. Modify the configuration files
cd /usr/hdp/3.1.0.0-78/spark2.4/conf cp log4j.properties.template log4j.propertie
7.1 Create spark-defaults.conf and add below lines to this file
cp spark-defaults.conf.template spark-defaults.conf
==========
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.eventLog.dir hdfs:///spark2-history/ spark.eventLog.enabled true spark.executor.extraJavaOptions -XX:+UseNUMA spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.history.fs.cleaner.enabled true spark.history.fs.cleaner.interval 7d spark.history.fs.cleaner.maxAge 90d spark.history.fs.logDirectory hdfs:///spark2-history/ spark.history.kerberos.keytab none spark.history.kerberos.principal none spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.store.path /var/lib/spark2/shs_db spark.history.ui.port 18081 spark.io.compression.lz4.blockSize 128kb spark.master yarn spark.shuffle.file.buffer 1m spark.shuffle.io.backLog 8192 spark.shuffle.io.serverThreads 128 spark.shuffle.unsafe.file.output.buffer 5m spark.sql.autoBroadcastJoinThreshold 26214400 spark.sql.hive.convertMetastoreOrc true spark.sql.hive.metastore.jars /usr/hdp/current/spark2-client/standalone-metastore/* spark.sql.hive.metastore.version 3.0 spark.sql.orc.filterPushdown true spark.sql.orc.impl native spark.sql.statistics.fallBackToHdfs true spark.sql.warehouse.dir /apps/spark2.4/warehouse spark.unsafe.sorter.spill.reader.buffer.size 1m spark.yarn.historyServer.address <hostname of localhost>:18081 [Make sure 18081 port is not used by any other process] spark.yarn.queue default
==========
7.2 Edit the spark-env.sh fil
==========
#!/usr/bin/env bash # This file is sourced when running various Spark programs. # Copy it as spark-env.sh and edit that to configure Spark for your site. # Options read in YARN client mode #SPARK_EXECUTOR_INSTANCES="2" #Number of workers to start (Default: 2) #SPARK_EXECUTOR_CORES="1" #Number of cores for the workers (Default: 1). #SPARK_EXECUTOR_MEMORY="1G" #Memory per Worker (e.g. 1000M, 2G) (Default: 1G) #SPARK_DRIVER_MEMORY="512M" #Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb) #SPARK_YARN_APP_NAME="spark" #The name of your application (Default: Spark) #SPARK_YARN_QUEUE="default" #The hadoop queue to use for allocation requests (Default: default) #SPARK_YARN_DIST_FILES="" #Comma separated list of files to be distributed with the job. #SPARK_YARN_DIST_ARCHIVES="" #Comma separated list of archives to be distributed with the job. # Generic options for the daemons used in the standalone deploy mode # Alternate conf dir. (Default: ${SPARK_HOME}/conf) export SPARK_CONF_DIR=${SPARK_CONF_DIR:-/usr/hdp/<your-hadoop-version>/spark2/conf} # Where log files are stored.(Default:${SPARK_HOME}/logs) export SPARK_LOG_DIR=/var/log/spark2 # Where the pid file is stored. (Default: /tmp) export SPARK_PID_DIR=/var/run/spark2 #Memory for Master, Worker and history server (default: 1024MB) export SPARK_DAEMON_MEMORY=2048m # A string representing this instance of spark.(Default: $USER) SPARK_IDENT_STRING=$USER # The scheduling priority for daemons. (Default: 0) SPARK_NICENESS=0 export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/<your-hadoop-version>/hadoop} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/<your-hadoop-version>/hadoop/conf} # The java implementation to use. export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 [Replace it with your java version]
============
8. Change the ownership of all the config file to spark:spark
9. Symlinks - Create below symlinks
ln -s /usr/hdp/2.6.5.0-292/spark2/conf/ /etc/spark2 ln -s /etc/hive/conf/hive-site.xml /usr/hdp/2.6.5.0-292/spark2/conf/hive-site.xml [make sure hive client is installed on this node]
10. Create HDFS directory
hadoop fs -mkdir /spark2-history hadoop fs -chown spark:hadoop /spark2-history hadoop fs -chmod -R 777 /spark2-history hadoop fs -mkdir /apps/spark2.4/warehouse hadoop fs -chown spark:spark /apps/spark2.4/warehouse hadoop fs -mkdir /user/spark hadoop fs -chown spark:spark /user/spark hadoop fs -chmod -R 755 /user/spark
Copy hadoop jars
/usr/hdp/3.1.0.0-78/hadoop/lib/jersey-* /usr/hdp/3.1.0.0-78/spark2.4/jars/
Start Spark history service
cd /usr/hdp/3.1.0.0-78/spark2.4/sbin/
11. Run a sample spark job
export SPARK_HOME=/usr/hdp/3.1.0.0-78/spark2.4/ spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
Created on 06-03-2019 05:18 PM
Good guide! But needs some minor amendments:
After doing all this, I still ended up getting the "bad substitution" error I mentioned in https://community.hortonworks.com/questions/247162/run-spark-24-jobs-on-hdp-31.html
I'll keep trying...
Created on 01-23-2020 09:35 AM
What if the HDP cluster is keberized?
How to setup supplementary spark in that case
Created on 01-28-2020 03:21 AM
If you have HDP 3+ and want to use hive metastore you will have problems with versioning between hive and spark. Right now spark available options of hive metastore are 0.12.0 through 2.3.3. You can check updates in this url:
Created on 07-27-2020 12:19 AM
Does this work for a Spark 3.0?