Im working on loading data into a Hive table using Spark. I am able to do it successfully. Below is the code that I have written to load the data into Hive.
input data: file-name: emp
1 Mark 1000 HR 2 Peter 1200 SALES 3 Henry 1500 HR 4 Adam 2000 IT 5 Steve 2500 IT 6 Brian 2700 IT 7 Michael 3000 HR 8 Steve 10000 SALES 9 Peter 7000 HR 10 Dan 6000 BS
Steps I performed to load the data:
val empFile = sc.textFile("emp") val empData = empFile.map(e => e.split("\t") case class employee(id:Int, name:String, salary:Int, dept:String) val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3))) val empDF = empRDD.toDF() empDF.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")
Columns in my hive Table:
col_name data_type id int name string salary int dept string
If I have a Hive table which is partitioned based on "location"
> create table emptab(id int, name String, salary Int, dept String) > partitioned by (location String) > row format delimited > fields terminated by '\t' > stored as parquet
Could anyone tell me how can I load the data into a particular partition of a Hive table ?
What changes do I have to do in this statement:
Also is the above way of loading the data, the right format or do we follow any standards to do that operation ?
Make sure that the location value is in the last column in the DF.
Then add the .partitionBy('location') to your DF write statement.
But where can I give the partition(location's) value ?
Also my case class doesn't have a column 'location'. When I tried your answer, it says "Matching column not found".
Could you let me know how can I correct it ?
Partition column is a logical entity related to the table. It cannot be present in the data or schema of the Hive table. So how can I put the data in DataFrame and a case class ?