Member since
08-13-2019
47
Posts
39
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2390 | 11-06-2016 06:27 PM | |
8723 | 10-03-2016 06:01 PM | |
2757 | 03-17-2016 02:21 AM |
04-21-2017
06:20 PM
Does it work in Chrome or Firefox?
... View more
11-06-2016
06:27 PM
Hi Anindya, generally all the tutorials and corresponding labs are for the latest sandbox, currently HDP 2.5. Two options: 1. Download the latest sandbox HDP 2.5 2. Use a similar notebook in the main Zeppelin notebook list: Lab 201: Intro to Machine Learning with Spark You can find other Zeppelin notebooks here: https://github.com/hortonworks-gallery/zeppelin-notebooks/tree/master And in the future each version of Sandbox will have a corresponding branch. E.g. there's HDP 2.5 branch now, so there won't be compatibility issues down the road with newer versions of Zeppelin on older Sandboxes.
... View more
10-03-2016
06:01 PM
2 Kudos
Make sure you are running the latest HDP 2.5 Sandbox. I've just tested it and I had no "prefix not found" related issues.
... View more
06-12-2016
01:59 AM
1 Kudo
Updated tutorial:
1) using centos-release-scl
2) wget https://bootstrap.pypa.io/ez_setup.py
Thanks!
... View more
06-09-2016
08:03 PM
2 Kudos
Checkout latest blog on HBase connector: http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
... View more
03-17-2016
02:21 AM
1 Kudo
Sridhar, as long as you're using Spark 1.6 I'd refer to https://spark.apache.org/docs/1.6.1/sql-programming-guide.html
... View more
03-16-2016
11:47 PM
1 Kudo
Hi Sridhar, can you post what version of Spark you are running and a link to the documentation you're referring to?
... View more
03-05-2016
12:32 AM
4 Kudos
Requirements HDP 2.3.x cluster, whether it is a multi-node cluster or a single-node HDP Sandbox. Installing The Spark 1.6 Technical Preview is provided in RPM and DEB package formats. The following instructions assume RPM packaging:
Download the Spark 1.6 RPM repository: wget -nv http://private-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.4.1-10/hdp.repo -O /etc/yum.repos.d/HDP-TP.repo
For installing on Ubuntu use the following:
http://private-repo-1.hortonworks.com/HDP/ubuntu12/2.x/updates/2.3.4.1-10/hdp.list
Install the Spark Package:
Download the Spark 1.6 RPM (and pySpark, if desired) and set it up on your HDP 2.3 cluster: yum install <strong>spark</strong>_2_3_4_1_10-master -y If you want to use pySpark, install it as follows and make sure that Python is installed on all nodes. yum install <strong>spark</strong>_2_3_4_1_10-python -y The RPM installer will also download core Hadoop dependencies. It will create “spark” as an OS user, and it will create the /user/spark directory in HDFS.
Set JAVA_HOME and SPARK_HOME:
Make sure that you set JAVA_HOME before you launch the Spark Shell or thrift server. export JAVA_HOME=<path to JDK 1.8> The Spark install creates the directory where Spark binaries are unpacked (/usr/hdp/2.3.4.1-10/spark). Set the SPARK_HOME variable to this directory: export SPARK_HOME=/usr/hdp/2.3.4.1-10/spark/
Create hive-site in the Spark conf directory:
As user root, create the file SPARK_HOME/conf/hive-site.xml. Edit the file to contain only the following configuration setting: <configuration><property><name>hive.metastore.uris</name>
<strong><!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
</strong><value>thrift://sandbox.hortonworks.com:9083</value><description>URI for client to contact metastore server</description></property></configuration> Run the Spark Pi Example To test compute-intensive tasks in Spark, the Pi example calculates pi by “throwing darts” at a circle — it generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi/4, which is used to estimate Pi.
Change to your Spark directory and switch to the spark OS user: cd $SPARK_HOME
su spark
Run the Spark Pi example in yarn-client mode: ./bin/spark-submit --class org.apache.spark.examples.SparkPi--master yarn-client --num-executors 3--driver-memory 512m--executor-memory 512m--executor-cores 1 lib/spark-examples*.jar 10 Note: The Pi job should complete without any failure messages. It should produce output similar to the following. Note the value of pi near the end of the output. 15/12/1613:21:05 INFO DAGScheduler:Job0 finished: reduce at SparkPi.scala:36, took 4.313782 s
<strong>Piis roughly 3.139492</strong>15/12/1613:21:05 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
... View more
03-05-2016
12:32 AM
2 Kudos
Copy, paste and run the following code: val data = Array(1, 2, 3, 4, 5) // create Array of Integers
val dataRDD = sc.parallelize(data) // create an RDD
val dataDF = dataRDD.toDF() // convert RDD to DataFrame
dataDF.write.parquet("data.parquet") // write to parquet
val newDataDF = sqlContext.
read.parquet("data.parquet") // read back parquet to DF
newDataDF.show() // show contents
If you run this code in a Zeppelin notebook you will see the following output
data: data: Array[Int] = Array(1, 2, 3, 4, 5)
dataRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:31
dataDF: org.apache.spark.sql.DataFrame = [_1: int]
newDataDF: org.apache.spark.sql.DataFrame = [_1: int]
+---+
| _1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
... View more
Labels:
03-03-2016
03:14 AM
Grab the latest HDP 2.4 Sandbox. It comes with Spark 1.6 & the python interpreter works in Zeppelin.
Also, see hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/ where pyspark interpreter is used.
... View more