- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to use spark to load data into a Hive partition ?
- Labels:
-
Apache Spark
Created on 06-15-2017 03:10 AM - edited 09-16-2022 04:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Im working on loading data into a Hive table using Spark. I am able to do it successfully. Below is the code that I have written to load the data into Hive.
input data: file-name: emp
1 Mark 1000 HR 2 Peter 1200 SALES 3 Henry 1500 HR 4 Adam 2000 IT 5 Steve 2500 IT 6 Brian 2700 IT 7 Michael 3000 HR 8 Steve 10000 SALES 9 Peter 7000 HR 10 Dan 6000 BS
Steps I performed to load the data:
val empFile = sc.textFile("emp") val empData = empFile.map(e => e.split("\t") case class employee(id:Int, name:String, salary:Int, dept:String) val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3))) val empDF = empRDD.toDF() empDF.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")
Columns in my hive Table:
col_name data_type id int name string salary int dept string
If I have a Hive table which is partitioned based on "location"
> create table emptab(id int, name String, salary Int, dept String) > partitioned by (location String) > row format delimited > fields terminated by '\t' > stored as parquet
Could anyone tell me how can I load the data into a particular partition of a Hive table ?
What changes do I have to do in this statement:
empDF.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")
Also is the above way of loading the data, the right format or do we follow any standards to do that operation ?
Created 06-15-2017 12:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Make sure that the location value is in the last column in the DF.
Then add the .partitionBy('location') to your DF write statement.
empDF.write.partitionBy('location').format('parquet').mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")
Created on 06-15-2017 09:45 PM - edited 06-15-2017 09:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But where can I give the partition(location's) value ?
Also my case class doesn't have a column 'location'. When I tried your answer, it says "Matching column not found".
Could you let me know how can I correct it ?
Created 06-16-2017 10:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The error is because it is now looking for the location columns in the DF and it doesn't exist. Make the change to your class and DF; then it should be good.
Created 06-16-2017 10:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the location value is the same for all of the data in the DF, then you may be better served by loading it statically. In which case, create the subfolder for the location value under the table's path and then write the DF out to that location.
Created on 06-20-2017 08:40 PM - edited 06-20-2017 11:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Partition column is a logical entity related to the table. It cannot be present in the data or schema of the Hive table. So how can I put the data in DataFrame and a case class ?
