Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

I am capturing the steps to install supplementary spark version on your HDP version. Installing the version not shipped by ambari is unsupported and not recommended, however they are some times customer need it for testing purposes.

Please find the steps below

Here are the steps :-

1. Create the spark user on all nodes. Add it to the hdfs group.

    useradd -G hdfs spark 

2. Create conf and log directory

mkdir -p /etc/spark2.4/conf
mkdir -p /var/log/spark2.4
chmod 755 /var/log/spark2.4/
chown spark:hadoop /var/log/spark2.4
mkdir -p /var/run/spark2.4
chown spark:hadoop /var/run/spark2.4
chmod 775 /var/run/spark2.4
mkdir /var/lib/spark2/
chown spark:spark /var/lib/spark2/


3. Create a directory in /usr/hdp and cd to the dir

mkdir -p /usr/hdp/<your hdp version>/spark2.4 (root user)
cd /usr/hdp/<your hdp version>/spark2.4


4. Download the tar file from location http://apache.claz.org/spark/spark-2.4.0/

 wget http://apache.claz.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

5. Extract spark tar file

tar -xzvf spark-2.4.0-bin-hadoop2.7.tgz
mv spark-2.4.0-bin-hadoop2.7/* .
rm -rf spark-2.4.0-bin-hadoop2.7*  [Clean up the directory]

6. Change ownership to root

chown root:root /usr/hdp/3.1.0.0-78/spark2.4 (root user)

7. Modify the configuration files

cd /usr/hdp/3.1.0.0-78/spark2.4/conf
cp log4j.properties.template log4j.propertie

7.1 Create spark-defaults.conf and add below lines to this file

cp  spark-defaults.conf.template spark-defaults.conf

==========

spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir hdfs:///spark2-history/
spark.eventLog.enabled true
spark.executor.extraJavaOptions -XX:+UseNUMA
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 7d
spark.history.fs.cleaner.maxAge 90d
spark.history.fs.logDirectory hdfs:///spark2-history/
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.store.path /var/lib/spark2/shs_db
spark.history.ui.port 18081
spark.io.compression.lz4.blockSize 128kb
spark.master yarn
spark.shuffle.file.buffer 1m
spark.shuffle.io.backLog 8192
spark.shuffle.io.serverThreads 128
spark.shuffle.unsafe.file.output.buffer 5m
spark.sql.autoBroadcastJoinThreshold 26214400
spark.sql.hive.convertMetastoreOrc true
spark.sql.hive.metastore.jars /usr/hdp/current/spark2-client/standalone-metastore/*
spark.sql.hive.metastore.version 3.0
spark.sql.orc.filterPushdown true
spark.sql.orc.impl native
spark.sql.statistics.fallBackToHdfs true
spark.sql.warehouse.dir /apps/spark2.4/warehouse
spark.unsafe.sorter.spill.reader.buffer.size 1m
spark.yarn.historyServer.address <hostname of localhost>:18081 [Make sure 18081 port is not used by any other process]
spark.yarn.queue default

==========

7.2 Edit the spark-env.sh fil

==========

#!/usr/bin/env bash
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
# Options read in YARN client mode
#SPARK_EXECUTOR_INSTANCES="2" #Number of workers to start (Default: 2)
#SPARK_EXECUTOR_CORES="1" #Number of cores for the workers (Default: 1).
#SPARK_EXECUTOR_MEMORY="1G" #Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
#SPARK_DRIVER_MEMORY="512M" #Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
#SPARK_YARN_APP_NAME="spark" #The name of your application (Default: Spark)
#SPARK_YARN_QUEUE="default" #The hadoop queue to use for allocation requests (Default: default)
#SPARK_YARN_DIST_FILES="" #Comma separated list of files to be distributed with the job.
#SPARK_YARN_DIST_ARCHIVES="" #Comma separated list of archives to be distributed with the job.

# Generic options for the daemons used in the standalone deploy mode

# Alternate conf dir. (Default: ${SPARK_HOME}/conf)
export SPARK_CONF_DIR=${SPARK_CONF_DIR:-/usr/hdp/<your-hadoop-version>/spark2/conf}

# Where log files are stored.(Default:${SPARK_HOME}/logs)
export SPARK_LOG_DIR=/var/log/spark2

# Where the pid file is stored. (Default: /tmp)
export SPARK_PID_DIR=/var/run/spark2

#Memory for Master, Worker and history server (default: 1024MB)
export SPARK_DAEMON_MEMORY=2048m

# A string representing this instance of spark.(Default: $USER)
SPARK_IDENT_STRING=$USER

# The scheduling priority for daemons. (Default: 0)
SPARK_NICENESS=0

export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/<your-hadoop-version>/hadoop}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/<your-hadoop-version>/hadoop/conf}

# The java implementation to use.
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 [Replace it with your java version]


============


8. Change the ownership of all the config file to spark:spark

9. Symlinks - Create below symlinks

ln -s  /usr/hdp/2.6.5.0-292/spark2/conf/ /etc/spark2
ln -s /etc/hive/conf/hive-site.xml /usr/hdp/2.6.5.0-292/spark2/conf/hive-site.xml [make sure hive client is installed on this node]


10. Create HDFS directory

hadoop fs -mkdir /spark2-history
hadoop fs -chown spark:hadoop /spark2-history
hadoop fs -chmod -R 777 /spark2-history
hadoop fs -mkdir /apps/spark2.4/warehouse
hadoop fs -chown spark:spark /apps/spark2.4/warehouse

hadoop fs -mkdir /user/spark
hadoop fs -chown spark:spark /user/spark
hadoop fs -chmod -R 755 /user/spark

Copy hadoop jars

/usr/hdp/3.1.0.0-78/hadoop/lib/jersey-* /usr/hdp/3.1.0.0-78/spark2.4/jars/

Start Spark history service

cd /usr/hdp/3.1.0.0-78/spark2.4/sbin/


11. Run a sample spark job

export SPARK_HOME=/usr/hdp/3.1.0.0-78/spark2.4/
spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
8,258 Views
0 Kudos
Comments
avatar
Explorer

Good guide! But needs some minor amendments:

  • step 9: I think this is quite wrong, and will (try to) create links in the main HDP Spark2 installation (2.3). I did "ln -s /etc/hive/conf/hive-site.xml /usr/hdp/3.1.0.0-78/spark2.4/conf/hive-site.xml"
  • step 10 :"Create HDFS directory": /spark2-history already will exist, and the "hadoop fs -mkdir /apps/spark2.4/warehouse" step needs a -p flag
  • step 10, "Copy hadoop JARs": should be "cp /usr/hdp/3.1.0.0-78/hadoop/client/jersey-* /usr/hdp/3.1.0.0-78/spark2.4/jars/" (the JARs in the client dir, not lib/)

After doing all this, I still ended up getting the "bad substitution" error I mentioned in https://community.hortonworks.com/questions/247162/run-spark-24-jobs-on-hdp-31.html

I'll keep trying...

avatar
Explorer

What if the HDP cluster is keberized?
How to setup supplementary spark in that case

avatar
Rising Star

If you have HDP 3+ and want to use hive metastore you will have problems with versioning between hive and spark. Right now spark available options of hive metastore are 0.12.0 through 2.3.3. You can check updates in this url:

https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-ve...

 

avatar
Explorer

Does this work for a Spark 3.0?