Community Articles

Find and share helpful community-sourced technical articles.
avatar

Requirements

HDP 2.3.x cluster, whether it is a multi-node cluster or a single-node HDP Sandbox.

Installing

The Spark 1.6 Technical Preview is provided in RPM and DEB package formats. The following instructions assume RPM packaging:

  1. Download the Spark 1.6 RPM repository:
    wget -nv http://private-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.4.1-10/hdp.repo -O /etc/yum.repos.d/HDP-TP.repo
    
    For installing on Ubuntu use the following: 
    http://private-repo-1.hortonworks.com/HDP/ubuntu12/2.x/updates/2.3.4.1-10/hdp.list
  2. Install the Spark Package: Download the Spark 1.6 RPM (and pySpark, if desired) and set it up on your HDP 2.3 cluster:
    yum install <strong>spark</strong>_2_3_4_1_10-master -y

    If you want to use pySpark, install it as follows and make sure that Python is installed on all nodes.

    yum install <strong>spark</strong>_2_3_4_1_10-python -y

    The RPM installer will also download core Hadoop dependencies. It will create “spark” as an OS user, and it will create the /user/spark directory in HDFS.

  3. Set JAVA_HOME and SPARK_HOME: Make sure that you set JAVA_HOME before you launch the Spark Shell or thrift server.
    export JAVA_HOME=<path to JDK 1.8>

    The Spark install creates the directory where Spark binaries are unpacked (/usr/hdp/2.3.4.1-10/spark). Set the SPARK_HOME variable to this directory:

    export SPARK_HOME=/usr/hdp/2.3.4.1-10/spark/
  4. Create hive-site in the Spark conf directory: As user root, create the file SPARK_HOME/conf/hive-site.xml. Edit the file to contain only the following configuration setting:
    <configuration><property><name>hive.metastore.uris</name>
    <strong><!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
    </strong><value>thrift://sandbox.hortonworks.com:9083</value><description>URI for client to contact metastore server</description></property></configuration>

Run the Spark Pi Example

To test compute-intensive tasks in Spark, the Pi example calculates pi by “throwing darts” at a circle — it generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi/4, which is used to estimate Pi.

  1. Change to your Spark directory and switch to the spark OS user:
    cd $SPARK_HOME
    su spark
  2. Run the Spark Pi example in yarn-client mode:
    ./bin/spark-submit --class org.apache.spark.examples.SparkPi--master yarn-client --num-executors 3--driver-memory 512m--executor-memory 512m--executor-cores 1 lib/spark-examples*.jar 10

    Note: The Pi job should complete without any failure messages. It should produce output similar to the following. Note the value of pi near the end of the output.

    15/12/1613:21:05 INFO DAGScheduler:Job0 finished: reduce at SparkPi.scala:36, took 4.313782 s
    <strong>Piis roughly 3.139492</strong>15/12/1613:21:05 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
4,683 Views
Comments
avatar
New Contributor

Hi, I couldn't see Spark pre-installed on HDP 2.4. If so, how to enable it?

avatar
Contributor

where to run these commands?