Does it work in Chrome or Firefox?
Hi Anindya, generally all the tutorials and corresponding labs are for the latest sandbox, currently HDP 2.5. Two options: 1. Download the latest sandbox HDP 2.5 2. Use a similar notebook in the main Zeppelin notebook list: Lab 201: Intro to Machine Learning with Spark You can find other Zeppelin notebooks here: And in the future each version of Sandbox will have a corresponding branch. E.g. there's HDP 2.5 branch now, so there won't be compatibility issues down the road with newer versions of Zeppelin on older Sandboxes.
Make sure you are running the latest HDP 2.5 Sandbox. I've just tested it and I had no "prefix not found" related issues.
Updated tutorial:
1) using centos-release-scl
2) wget
Checkout latest blog on HBase connector:
Sridhar, as long as you're using Spark 1.6 I'd refer to
Hi Sridhar, can you post what version of Spark you are running and a link to the documentation you're referring to?
Requirements HDP 2.3.x cluster, whether it is a multi-node cluster or a single-node HDP Sandbox. Installing The Spark 1.6 Technical Preview is provided in RPM and DEB package formats. The following instructions assume RPM packaging:
Download the Spark 1.6 RPM repository: wget -nv -O /etc/yum.repos.d/HDP-TP.repo
For installing on Ubuntu use the following:
Install the Spark Package:
Download the Spark 1.6 RPM (and pySpark, if desired) and set it up on your HDP 2.3 cluster: yum install <strong>spark</strong>_2_3_4_1_10-master -y If you want to use pySpark, install it as follows and make sure that Python is installed on all nodes. yum install <strong>spark</strong>_2_3_4_1_10-python -y The RPM installer will also download core Hadoop dependencies. It will create “spark” as an OS user, and it will create the /user/spark directory in HDFS.
Make sure that you set JAVA_HOME before you launch the Spark Shell or thrift server. export JAVA_HOME=<path to JDK 1.8> The Spark install creates the directory where Spark binaries are unpacked (/usr/hdp/ Set the SPARK_HOME variable to this directory: export SPARK_HOME=/usr/hdp/
Create hive-site in the Spark conf directory:
As user root, create the file SPARK_HOME/conf/hive-site.xml. Edit the file to contain only the following configuration setting: <configuration><property><name>hive.metastore.uris</name>
<strong><!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
</strong><value>thrift://</value><description>URI for client to contact metastore server</description></property></configuration> Run the Spark Pi Example To test compute-intensive tasks in Spark, the Pi example calculates pi by “throwing darts” at a circle — it generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi/4, which is used to estimate Pi.
Change to your Spark directory and switch to the spark OS user: cd $SPARK_HOME
su spark
Run the Spark Pi example in yarn-client mode: ./bin/spark-submit --class org.apache.spark.examples.SparkPi--master yarn-client --num-executors 3--driver-memory 512m--executor-memory 512m--executor-cores 1 lib/spark-examples*.jar 10 Note: The Pi job should complete without any failure messages. It should produce output similar to the following. Note the value of pi near the end of the output. 15/12/1613:21:05 INFO DAGScheduler:Job0 finished: reduce at SparkPi.scala:36, took 4.313782 s
<strong>Piis roughly 3.139492</strong>15/12/1613:21:05 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
Copy, paste and run the following code: val data = Array(1, 2, 3, 4, 5) // create Array of Integers
val dataRDD = sc.parallelize(data) // create an RDD
val dataDF = dataRDD.toDF() // convert RDD to DataFrame
dataDF.write.parquet("data.parquet") // write to parquet
val newDataDF = sqlContext.
read.parquet("data.parquet") // read back parquet to DF // show contents
If you run this code in a Zeppelin notebook you will see the following output
data: data: Array[Int] = Array(1, 2, 3, 4, 5)
dataRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:31
dataDF: org.apache.spark.sql.DataFrame = [_1: int]
newDataDF: org.apache.spark.sql.DataFrame = [_1: int]
| _1|
| 1|
| 2|
| 3|
| 4|
| 5|
Grab the latest HDP 2.4 Sandbox. It comes with Spark 1.6 & the python interpreter works in Zeppelin.
Also, see where pyspark interpreter is used.
