Created 09-19-2016 12:35 PM
Hi,
I want to create and load data into Hive Table through sparkQL using scala code(i have to built jar and execute through spark-submit) please help me ,it's very thankful to me
Created 09-19-2016 04:24 PM
This scala code will create a dataframe and load it into Hive. Hope this helps!
// Create dummy data and load it into a DataFrame case class rowschema(id:Int, record:String) val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3"))) df.registerTempTable("tempTable") // Create new Hive Table and load tempTable sqlContext.sql("create table newHiveTable as select * from tempTable")
Created 09-19-2016 04:24 PM
This scala code will create a dataframe and load it into Hive. Hope this helps!
// Create dummy data and load it into a DataFrame case class rowschema(id:Int, record:String) val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3"))) df.registerTempTable("tempTable") // Create new Hive Table and load tempTable sqlContext.sql("create table newHiveTable as select * from tempTable")
Created 02-22-2018 02:36 PM
This worked well after adding a ;
val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")));
Created 02-22-2018 03:49 PM
It looks like tempTable is not sticking around:
scala> case class rowschema(id:Int, record:String){
| val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")));{
| df.registerTempTable("tempTable")
| }
| }
defined class rowschema
scala> sqlContext.sql("create table newHiveTable as select * from tempTable")
org.apache.spark.sql.AnalysisException: Table not found: tempTable; line 1 pos 43
Created 09-19-2016 04:29 PM
Try with Zeppelin:
https://community.hortonworks.com/questions/34894/apache-zeppelin-with-hive.html
Makes it really easy to try things out.
See this article: https://community.hortonworks.com/content/kbentry/34784/data-ingest-with-apache-zeppelin-apache-spar...
Created 09-19-2016 09:25 PM
Hi AnjiReddy
Like Dan said, using some type of notebook would probably make your life a lot easier. It will allow you to work interactively with spark with visual outputs and work with.
Check out :
- Zeppelin - https://zeppelin.apache.org/ Note : You can find a pre-installed version of Zeppeling in the hdp sandbox (if you don't want to go throught the hassle of installing and configuring the tool)
- Jupyter - http://jupyter.org/
Note when working with spark, you'll probably want to install the spark (scala) Kernel https://github.com/ibm-et/spark-kernel
That said you can make it work using only the spark shell. I assume that there is a file already available in hdfs at /user/zeppelin/yahoo_stocks.csv cvs
csv file :
date,open,high,low,close,volume,adj_close 2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34 2015-04-27,44.65,45.10,44.25,44.36,10840900,44.36 2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52 2015-04-23,43.92,44.06,43.58,43.70,14274900,43.70 2015-04-22,44.58,44.85,43.67,43.98,32241200,43.98 2015-04-21,45.15,45.18,44.45,44.49,16103700,44.49 2015-04-20,44.73,44.91,44.41,44.66,10052900,44.66 2015-04-17,45.30,45.44,44.25,44.45,13305700,44.45 2015-04-16,45.82,46.13,45.53,45.78,13800300,45.78 2015-04-15,45.46,45.83,45.23,45.73,15033500,45.73 2015-04-14,44.82,45.64,44.79,45.53,12365800,45.53 2015-04-13,45.25,45.59,44.72,44.77,8837300,44.77 2015-04-10,45.79,45.79,45.00,45.18,8436400,45.18 2015-04-09,45.70,46.17,45.16,45.63,13678000,45.63 2015-04-08,43.86,45.19,43.80,45.17,16071000,45.17 2015-04-07,43.79,44.22,43.56,43.61,11382000,43.61 2015-04-06,43.82,44.03,43.61,43.67,10717000,43.67 2015-04-02,44.24,44.36,43.68,44.15,12229400,44.15 2015-04-01,44.45,44.60,43.95,44.13,14722300,44.13 2015-03-31,44.82,45.20,44.42,44.44,10415500,44.44 2015-03-30,45.36,45.42,44.82,44.95,8884300,44.95 2015-03-27,45.20,45.67,45.01,45.10,20563500,45.10 2015-03-26,43.78,44.67,43.68,44.47,16162900,44.47 2015-03-25,44.59,44.93,44.13,44.20,14036900,44.20 2015-03-24,44.64,44.78,44.28,44.42,7559100,44.42 2015-03-23,45.25,45.54,44.71,44.72,8268800,44.72 2015-03-20,45.37,45.58,44.91,45.04,14194200,45.04 2015-03-19,44.90,45.45,44.81,44.98,14758000,44.98 2015-03-18,43.58,44.71,43.43,44.67,18919900,44.67
Launch Spark shell
Note The spark csv parckage makes reading cvs files a lot easier - https://github.com/databricks/spark-csv
spark-shell --master yarn --packages com.databricks:spark-csv_2.10:1.5.0
Code :
// create RDD from file val input_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",",").load("hdfs://sandbox.hortonworks.com:8020/user/zeppelin/yahoo_stocks.csv") // save file to hive (the spark way) input_df.write.saveAsTable("%s.%s".format( "default" , "stockPrice" )) // save the file the sql way input_df.registerTempTable("stock_price") sqlContext.sql("create table hivestockprice as select * from stockPrice") //You can read back the data from Hive in the same way sqlContext.sql("Select * from hivestockprice").show(10)
The Zeppelin way :
Paragraph 1 : Read data and save to Hive :
%spark //read file val input_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",",").load("hdfs://sandbox.hortonworks.com:8020/user/zeppelin/yahoo_stocks.csv") //save to Hive input_df.write.saveAsTable("%s.%s".format( "default" , "stockPrice" ))
Paragraphe 2 : Read back from Hive using SQL
%sql Select * from stockPrice