- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to insert data into Hive from SparkSQL
- Labels:
-
Apache Spark
Created on 11-16-2018 08:48 AM - edited 09-16-2022 06:54 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Below is my code
import sqlContext.implicits._
import org.apache.spark.sql
val eBayText = sc.textFile("/user/cloudera/spark/servicesDemo.csv")
val hospitalDataText = sc.textFile("/user/cloudera/spark/servicesDemo.csv")
val header = hospitalDataText.first()
val hospitalData = hospitalDataText.filter(a=>a!=header)
case class Services(uhid:String,locationid:String,doctorid:String)
val hData = hospitalData.map(_.split(",")).map(p=>Services(p(0),p(1),p(2)))
val hosService = hData.toDF()
hosService.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/hosdata")
This code created 'hosdata' folder at specified path, which contains data in 'parquet' format.
But when i went to hive and check table got created or not the, i did not able to see any table name as 'hosdata'.
So i run below commands.
hosService.write.mode("overwrite").saveAsTable("hosData")
sqlContext.sql("show tables").show
shows me below result
+--------------------+-----------+
| tableName|isTemporary|
+--------------------+-----------+
| hosdata| false|
+--------------------+-----------+
But again when i check in hive, i can not see table 'hosdata'
Could anyone let me know what step i am missing?
Created 11-20-2018 09:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should create an external table in HIVE and then issue a refresh command, so after your spark application finishes, you will see new data in your table.
For creating external table see the Cloudera docs.
Created 11-20-2018 09:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should create an external table in HIVE and then issue a refresh command, so after your spark application finishes, you will see new data in your table.
For creating external table see the Cloudera docs.
Created 11-20-2018 08:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your reply.
May i know what is referesh command ?
And can i see table in hive only after i close spark application?
Created 11-21-2018 09:48 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank You. This works for me. 🙂
Created 11-21-2018 09:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 01-26-2019 08:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another approach of inserting the data which we are following in our project is not to insert the data in HIVE directly from SPARK instead do the following.
1. Read the input csv file in SPARK and do the transformation of the data according to requirement.
2. Save the data back into an output csv file in HDFS
3. Push the data from the output csv into HIVE using HIVE -f or HIVE -e command from shell.
