question Re: How to Create/LOAD data into table through sparkQL with scala code only..? in Support Questions

How to Create/LOAD data into table through sparkQL with scala code only..?

ANJIREDDY — Mon, 19 Sep 2016 19:35:34 GMT

Hi,

I want to create and load data into Hive Table through sparkQL using scala code(i have to built jar and execute through spark-submit) please help me ,it's very thankful to me

Re: How to Create/LOAD data into table through sparkQL with scala code only..?

dzaratsian — Mon, 19 Sep 2016 23:24:07 GMT

This scala code will create a dataframe and load it into Hive. Hope this helps!

// Create dummy data and load it into a DataFrame
case class rowschema(id:Int, record:String)
val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")))
df.registerTempTable("tempTable")

// Create new Hive Table and load tempTable
sqlContext.sql("create table newHiveTable as select * from tempTable")

Re: How to Create/LOAD data into table through sparkQL with scala code only..?

TimothySpann — Mon, 19 Sep 2016 23:29:59 GMT

Try with Zeppelin:

https://community.hortonworks.com/questions/34894/apache-zeppelin-with-hive.html

Makes it really easy to try things out.

See this article: https://community.hortonworks.com/content/kbentry/34784/data-ingest-with-apache-zeppelin-apache-spark-16-h.html

Re: How to Create/LOAD data into table through sparkQL with scala code only..?

mlamairesse — Tue, 20 Sep 2016 04:25:44 GMT

Hi AnjiReddy

Like Dan said, using some type of notebook would probably make your life a lot easier. It will allow you to work interactively with spark with visual outputs and work with.

Check out :

- Zeppelin - https://zeppelin.apache.org/ Note : You can find a pre-installed version of Zeppeling in the hdp sandbox (if you don't want to go throught the hassle of installing and configuring the tool)

- Jupyter - http://jupyter.org/

Note when working with spark, you'll probably want to install the spark (scala) Kernel https://github.com/ibm-et/spark-kernel

That said you can make it work using only the spark shell. I assume that there is a file already available in hdfs at /user/zeppelin/yahoo_stocks.csv cvs

csv file :

date,open,high,low,close,volume,adj_close
2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34
2015-04-27,44.65,45.10,44.25,44.36,10840900,44.36
2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52
2015-04-23,43.92,44.06,43.58,43.70,14274900,43.70
2015-04-22,44.58,44.85,43.67,43.98,32241200,43.98
2015-04-21,45.15,45.18,44.45,44.49,16103700,44.49
2015-04-20,44.73,44.91,44.41,44.66,10052900,44.66
2015-04-17,45.30,45.44,44.25,44.45,13305700,44.45
2015-04-16,45.82,46.13,45.53,45.78,13800300,45.78
2015-04-15,45.46,45.83,45.23,45.73,15033500,45.73
2015-04-14,44.82,45.64,44.79,45.53,12365800,45.53
2015-04-13,45.25,45.59,44.72,44.77,8837300,44.77
2015-04-10,45.79,45.79,45.00,45.18,8436400,45.18
2015-04-09,45.70,46.17,45.16,45.63,13678000,45.63
2015-04-08,43.86,45.19,43.80,45.17,16071000,45.17
2015-04-07,43.79,44.22,43.56,43.61,11382000,43.61
2015-04-06,43.82,44.03,43.61,43.67,10717000,43.67
2015-04-02,44.24,44.36,43.68,44.15,12229400,44.15
2015-04-01,44.45,44.60,43.95,44.13,14722300,44.13
2015-03-31,44.82,45.20,44.42,44.44,10415500,44.44
2015-03-30,45.36,45.42,44.82,44.95,8884300,44.95
2015-03-27,45.20,45.67,45.01,45.10,20563500,45.10
2015-03-26,43.78,44.67,43.68,44.47,16162900,44.47
2015-03-25,44.59,44.93,44.13,44.20,14036900,44.20
2015-03-24,44.64,44.78,44.28,44.42,7559100,44.42
2015-03-23,45.25,45.54,44.71,44.72,8268800,44.72
2015-03-20,45.37,45.58,44.91,45.04,14194200,45.04
2015-03-19,44.90,45.45,44.81,44.98,14758000,44.98
2015-03-18,43.58,44.71,43.43,44.67,18919900,44.67

Launch Spark shell

Note The spark csv parckage makes reading cvs files a lot easier - https://github.com/databricks/spark-csv

spark-shell --master yarn --packages com.databricks:spark-csv_2.10:1.5.0

Code :

// create RDD from file 
val input_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",",").load("hdfs://sandbox.hortonworks.com:8020/user/zeppelin/yahoo_stocks.csv") 

// save file to hive (the spark way) 
input_df.write.saveAsTable("%s.%s".format( "default" , "stockPrice" ))

// save the file the sql way 
input_df.registerTempTable("stock_price")
sqlContext.sql("create table hivestockprice as select * from stockPrice") 

//You can read back the data from Hive in the same way 
sqlContext.sql("Select * from hivestockprice").show(10)

The Zeppelin way :

Paragraph 1 : Read data and save to Hive :

%spark 
//read file
val input_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",",").load("hdfs://sandbox.hortonworks.com:8020/user/zeppelin/yahoo_stocks.csv") 

//save to Hive
input_df.write.saveAsTable("%s.%s".format( "default" , "stockPrice" ))

Paragraphe 2 : Read back from Hive using SQL

%sql
Select * from stockPrice

Re: How to Create/LOAD data into table through sparkQL with scala code only..?

Hagedorn — Thu, 22 Feb 2018 22:36:50 GMT

This worked well after adding a ;

val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")));

Re: How to Create/LOAD data into table through sparkQL with scala code only..?

Hagedorn — Thu, 22 Feb 2018 23:49:43 GMT

It looks like tempTable is not sticking around:

scala> case class rowschema(id:Int, record:String){
| val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")));{
| df.registerTempTable("tempTable")
| }
| }
defined class rowschema

scala> sqlContext.sql("create table newHiveTable as select * from tempTable")
org.apache.spark.sql.AnalysisException: Table not found: tempTable; line 1 pos 43