Support Questions

Sidhartha · ‎06-15-2017

Im working on loading data into a Hive table using Spark. I am able to do it successfully. Below is the code that I have written to load the data into Hive.

input data: file-name: emp

1   Mark    1000    HR
2   Peter   1200    SALES
3   Henry   1500    HR
4   Adam    2000    IT
5   Steve   2500    IT
6   Brian   2700    IT
7   Michael 3000    HR
8   Steve   10000   SALES
9   Peter   7000    HR
10  Dan     6000    BS

Steps I performed to load the data:

val empFile = sc.textFile("emp")
val empData = empFile.map(e => e.split("\t")
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = empRDD.toDF()
empDF.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")

Columns in my hive Table:

col_name    data_type
id          int
name        string
salary      int
dept        string

If I have a Hive table which is partitioned based on "location"

> create table emptab(id int, name String, salary Int, dept String)
> partitioned by (location String)
> row format delimited
> fields terminated by '\t'
> stored as parquet

Could anyone tell me how can I load the data into a particular partition of a Hive table ?

What changes do I have to do in this statement:

empDF.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")

Also is the above way of loading the data, the right format or do we follow any standards to do that operation ?

mbigelow · ‎06-15-2017

Make sure that the location value is in the last column in the DF.

Then add the .partitionBy('location') to your DF write statement.

empDF.write.partitionBy('location').format('parquet').mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")

Sidhartha · ‎06-15-2017

But where can I give the partition(location's) value ?
Also my case class doesn't have a column 'location'. When I tried your answer, it says "Matching column not found".

Could you let me know how can I correct it ?

mbigelow · ‎06-16-2017

Your case class must include it so that it maps to the DF correctly.

The error is because it is now looking for the location columns in the DF and it doesn't exist. Make the change to your class and DF; then it should be good.

mbigelow · ‎06-16-2017

A way to think about it is "How will it know the location value?" It won't. You must tell it what the value is and per row so it can partition it dynamically as it is loading the data.

If the location value is the same for all of the data in the DF, then you may be better served by loading it statically. In which case, create the subfolder for the location value under the table's path and then write the DF out to that location.

Sidhartha · ‎06-20-2017

Partition column is a logical entity related to the table. It cannot be present in the data or schema of the Hive table. So how can I put the data in DataFrame and a case class ?

Cloudera Community

Support Questions

How to use spark to load data into a Hive partition ?