Created on 06-15-2017 03:10 AM - edited 09-16-2022 04:45 AM
Im working on loading data into a Hive table using Spark. I am able to do it successfully. Below is the code that I have written to load the data into Hive.
input data: file-name: emp
1 Mark 1000 HR 2 Peter 1200 SALES 3 Henry 1500 HR 4 Adam 2000 IT 5 Steve 2500 IT 6 Brian 2700 IT 7 Michael 3000 HR 8 Steve 10000 SALES 9 Peter 7000 HR 10 Dan 6000 BS
Steps I performed to load the data:
val empFile = sc.textFile("emp") val empData = empFile.map(e => e.split("\t") case class employee(id:Int, name:String, salary:Int, dept:String) val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3))) val empDF = empRDD.toDF() empDF.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")
Columns in my hive Table:
col_name data_type id int name string salary int dept string
If I have a Hive table which is partitioned based on "location"
> create table emptab(id int, name String, salary Int, dept String) > partitioned by (location String) > row format delimited > fields terminated by '\t' > stored as parquet
Could anyone tell me how can I load the data into a particular partition of a Hive table ?
What changes do I have to do in this statement:
empDF.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")
Also is the above way of loading the data, the right format or do we follow any standards to do that operation ?
Created 06-15-2017 12:21 PM
Make sure that the location value is in the last column in the DF.
Then add the .partitionBy('location') to your DF write statement.
empDF.write.partitionBy('location').format('parquet').mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")
Created on 06-15-2017 09:45 PM - edited 06-15-2017 09:46 PM
But where can I give the partition(location's) value ?
Also my case class doesn't have a column 'location'. When I tried your answer, it says "Matching column not found".
Could you let me know how can I correct it ?
Created 06-16-2017 10:14 AM
Created 06-16-2017 10:16 AM
Created on 06-20-2017 08:40 PM - edited 06-20-2017 11:16 PM
Partition column is a logical entity related to the table. It cannot be present in the data or schema of the Hive table. So how can I put the data in DataFrame and a case class ?