Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to Create/LOAD data into table through sparkQL with scala code only..?

avatar
Rising Star

Hi,

I want to create and load data into Hive Table through sparkQL using scala code(i have to built jar and execute through spark-submit) please help me ,it's very thankful to me

1 ACCEPTED SOLUTION

avatar

This scala code will create a dataframe and load it into Hive. Hope this helps!

// Create dummy data and load it into a DataFrame
case class rowschema(id:Int, record:String)
val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")))
df.registerTempTable("tempTable")

// Create new Hive Table and load tempTable
sqlContext.sql("create table newHiveTable as select * from tempTable")

View solution in original post

5 REPLIES 5

avatar

This scala code will create a dataframe and load it into Hive. Hope this helps!

// Create dummy data and load it into a DataFrame
case class rowschema(id:Int, record:String)
val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")))
df.registerTempTable("tempTable")

// Create new Hive Table and load tempTable
sqlContext.sql("create table newHiveTable as select * from tempTable")

avatar
Explorer

This worked well after adding a ;

val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")));

avatar
Explorer

It looks like tempTable is not sticking around:

scala> case class rowschema(id:Int, record:String){
| val df = sqlContext.createDataFrame(Seq(rowschema(1,"record1"), rowschema(2,"record2"), rowschema(3,"record3")));{
| df.registerTempTable("tempTable")
| }
| }
defined class rowschema

scala> sqlContext.sql("create table newHiveTable as select * from tempTable")
org.apache.spark.sql.AnalysisException: Table not found: tempTable; line 1 pos 43

avatar
Master Guru

avatar
Contributor

Hi AnjiReddy

Like Dan said, using some type of notebook would probably make your life a lot easier. It will allow you to work interactively with spark with visual outputs and work with.

Check out :

- Zeppelin - https://zeppelin.apache.org/ Note : You can find a pre-installed version of Zeppeling in the hdp sandbox (if you don't want to go throught the hassle of installing and configuring the tool)

- Jupyter - http://jupyter.org/

Note when working with spark, you'll probably want to install the spark (scala) Kernel https://github.com/ibm-et/spark-kernel

That said you can make it work using only the spark shell. I assume that there is a file already available in hdfs at /user/zeppelin/yahoo_stocks.csv cvs

csv file :

date,open,high,low,close,volume,adj_close
2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34
2015-04-27,44.65,45.10,44.25,44.36,10840900,44.36
2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52
2015-04-23,43.92,44.06,43.58,43.70,14274900,43.70
2015-04-22,44.58,44.85,43.67,43.98,32241200,43.98
2015-04-21,45.15,45.18,44.45,44.49,16103700,44.49
2015-04-20,44.73,44.91,44.41,44.66,10052900,44.66
2015-04-17,45.30,45.44,44.25,44.45,13305700,44.45
2015-04-16,45.82,46.13,45.53,45.78,13800300,45.78
2015-04-15,45.46,45.83,45.23,45.73,15033500,45.73
2015-04-14,44.82,45.64,44.79,45.53,12365800,45.53
2015-04-13,45.25,45.59,44.72,44.77,8837300,44.77
2015-04-10,45.79,45.79,45.00,45.18,8436400,45.18
2015-04-09,45.70,46.17,45.16,45.63,13678000,45.63
2015-04-08,43.86,45.19,43.80,45.17,16071000,45.17
2015-04-07,43.79,44.22,43.56,43.61,11382000,43.61
2015-04-06,43.82,44.03,43.61,43.67,10717000,43.67
2015-04-02,44.24,44.36,43.68,44.15,12229400,44.15
2015-04-01,44.45,44.60,43.95,44.13,14722300,44.13
2015-03-31,44.82,45.20,44.42,44.44,10415500,44.44
2015-03-30,45.36,45.42,44.82,44.95,8884300,44.95
2015-03-27,45.20,45.67,45.01,45.10,20563500,45.10
2015-03-26,43.78,44.67,43.68,44.47,16162900,44.47
2015-03-25,44.59,44.93,44.13,44.20,14036900,44.20
2015-03-24,44.64,44.78,44.28,44.42,7559100,44.42
2015-03-23,45.25,45.54,44.71,44.72,8268800,44.72
2015-03-20,45.37,45.58,44.91,45.04,14194200,45.04
2015-03-19,44.90,45.45,44.81,44.98,14758000,44.98
2015-03-18,43.58,44.71,43.43,44.67,18919900,44.67


Launch Spark shell

Note The spark csv parckage makes reading cvs files a lot easier - https://github.com/databricks/spark-csv

spark-shell --master yarn --packages com.databricks:spark-csv_2.10:1.5.0 

Code :

// create RDD from file 
val input_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",",").load("hdfs://sandbox.hortonworks.com:8020/user/zeppelin/yahoo_stocks.csv") 

// save file to hive (the spark way) 
input_df.write.saveAsTable("%s.%s".format( "default" , "stockPrice" ))

// save the file the sql way 
input_df.registerTempTable("stock_price")
sqlContext.sql("create table hivestockprice as select * from stockPrice") 

//You can read back the data from Hive in the same way 
sqlContext.sql("Select * from hivestockprice").show(10)

The Zeppelin way :

Paragraph 1 : Read data and save to Hive :

%spark 
//read file
val input_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",",").load("hdfs://sandbox.hortonworks.com:8020/user/zeppelin/yahoo_stocks.csv") 

//save to Hive
input_df.write.saveAsTable("%s.%s".format( "default" , "stockPrice" )) 

Paragraphe 2 : Read back from Hive using SQL

%sql
Select * from stockPrice