Member since 
    
	
		
		
		08-13-2019
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                47
            
            
                Posts
            
        
                39
            
            
                Kudos Received
            
        
                3
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3083 | 11-06-2016 06:27 PM | |
| 11323 | 10-03-2016 06:01 PM | |
| 3406 | 03-17-2016 02:21 AM | 
			
    
	
		
		
		04-21-2017
	
		
		06:20 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Does it work in Chrome or Firefox? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-06-2016
	
		
		06:27 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi Anindya, generally all the tutorials and corresponding labs are for the latest sandbox, currently HDP 2.5.  Two options:  1. Download the latest sandbox HDP 2.5  2. Use a similar notebook in the main Zeppelin notebook list: Lab 201: Intro to Machine Learning with Spark  You can find other Zeppelin notebooks here: https://github.com/hortonworks-gallery/zeppelin-notebooks/tree/master  And in the future each version of Sandbox will have a corresponding branch. E.g. there's HDP 2.5 branch now, so there won't be compatibility issues down the road with newer versions of Zeppelin on older Sandboxes. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-03-2016
	
		
		06:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Make sure you are running the latest HDP 2.5 Sandbox. I've just tested it and I had no "prefix not found" related issues. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-12-2016
	
		
		01:59 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Updated tutorial:  
1) using centos-release-scl 
2) wget https://bootstrap.pypa.io/ez_setup.py
Thanks! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-09-2016
	
		
		08:03 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Checkout latest blog on HBase connector: http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-17-2016
	
		
		02:21 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Sridhar, as long as you're using Spark 1.6 I'd refer to https://spark.apache.org/docs/1.6.1/sql-programming-guide.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-16-2016
	
		
		11:47 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi Sridhar, can you post what version of Spark you are running and a link to the documentation you're referring to? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-05-2016
	
		
		12:32 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		4 Kudos
		
	
				
		
	
		
					
							 Requirements  HDP 2.3.x cluster, whether it is a multi-node cluster or a single-node HDP Sandbox.  Installing  The Spark 1.6 Technical Preview is provided in RPM and DEB package formats. The following instructions assume RPM packaging: 
 
 Download the Spark 1.6 RPM repository: wget -nv http://private-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.4.1-10/hdp.repo -O /etc/yum.repos.d/HDP-TP.repo
For installing on Ubuntu use the following: 
http://private-repo-1.hortonworks.com/HDP/ubuntu12/2.x/updates/2.3.4.1-10/hdp.list 
 
 Install the Spark Package:
Download the Spark 1.6 RPM (and pySpark, if desired) and set it up on your HDP 2.3 cluster: yum install <strong>spark</strong>_2_3_4_1_10-master -y  If you want to use pySpark, install it as follows and make sure that Python is installed on all nodes.  yum install <strong>spark</strong>_2_3_4_1_10-python -y  The RPM installer will also download core Hadoop dependencies. It will create “spark” as an OS user, and it will create the /user/spark directory in HDFS.  
 Set JAVA_HOME and SPARK_HOME:
Make sure that you set JAVA_HOME before you launch the Spark Shell or thrift server. export JAVA_HOME=<path to JDK 1.8>  The Spark install creates the directory where Spark binaries are unpacked (/usr/hdp/2.3.4.1-10/spark). Set the SPARK_HOME variable to this directory:  export SPARK_HOME=/usr/hdp/2.3.4.1-10/spark/ 
 
 Create hive-site in the Spark conf directory:
As user root, create the file SPARK_HOME/conf/hive-site.xml. Edit the file to contain only the following configuration setting: <configuration><property><name>hive.metastore.uris</name>
<strong><!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
</strong><value>thrift://sandbox.hortonworks.com:9083</value><description>URI for client to contact metastore server</description></property></configuration>    Run the Spark Pi Example  To test compute-intensive tasks in Spark, the Pi example calculates pi by “throwing darts” at a circle — it generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates  pi/4, which is used to estimate Pi. 
 
 Change to your Spark directory and switch to the spark OS user: cd $SPARK_HOME
su spark 
 
 Run the Spark Pi example in yarn-client mode: ./bin/spark-submit --class org.apache.spark.examples.SparkPi--master yarn-client --num-executors 3--driver-memory 512m--executor-memory 512m--executor-cores 1 lib/spark-examples*.jar 10  Note: The Pi job should complete without any failure messages. It should produce output similar to the following. Note the value of pi near the end of the output.  15/12/1613:21:05 INFO DAGScheduler:Job0 finished: reduce at SparkPi.scala:36, took 4.313782 s
<strong>Piis roughly 3.139492</strong>15/12/1613:21:05 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-05-2016
	
		
		12:32 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Copy, paste and run the following code:   val data = Array(1, 2, 3, 4, 5)                     // create Array of Integers
val dataRDD = sc.parallelize(data)                  // create an RDD
val dataDF = dataRDD.toDF()                         // convert RDD to DataFrame
dataDF.write.parquet("data.parquet")                // write to parquet
val newDataDF = sqlContext.
                read.parquet("data.parquet")        // read back parquet to DF
newDataDF.show()                                    // show contents  
If you run this code in a Zeppelin notebook you will see the following output
data:   data: Array[Int] = Array(1, 2, 3, 4, 5)
dataRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:31
dataDF: org.apache.spark.sql.DataFrame = [_1: int]
newDataDF: org.apache.spark.sql.DataFrame = [_1: int]
+---+
| _1|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
+---+ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		03-03-2016
	
		
		03:14 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Grab the latest HDP 2.4 Sandbox. It comes with Spark 1.6 & the python interpreter works in Zeppelin.
Also, see hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/ where pyspark interpreter is used. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













