Posts: 26
Registered: ‎11-04-2016
Accepted Solution

Saving Spark 2.2 dataframs in Hive table

[ Edited ]



I have a problem with Spark 2.2 (latest CDH 5.12.0) and saving DataFrame into Hive table.


Things I can do:


1. I can easily read tables from Hive tables in Spark 2.2

2. I can do saveAsTable in Spark 1.6 into Hive table and read it from Spark 2.2

3. I can do write.saveAsTable in Spark 2.2 and see the files and data inside Hive table



Things I cannot do in Spark 2.2:


4. When I read Hive table saved by Spark 2.2 in spark2-shell, it shows empty rows. It has all the fields and schema but no data.


I don't understand what could cause this problem.

Any help would be appreciate it.




scala> val df = sc.parallelize(
     |   Seq(
     |     ("first", Array(2.0, 1.0, 2.1, 5.4)),
     |     ("test", Array(1.5, 0.5, 0.9, 3.7)),
     |     ("choose", Array(8.0, 2.9, 9.1, 2.5))
     |   ), 3
     | ).toDF
df: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>]

|    _1|                  _2|
| first|[2.0, 1.0, 2.1, 5.4]|
|  test|[1.5, 0.5, 0.9, 3.7]|
|choose|[8.0, 2.9, 9.1, 2.5]|

scala> df.write.saveAsTable("database.test")

scala> val savedDF = spark.sql("SELECT * FROM database.test")
res45: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>]

scala> savedDF.count
res55: Long = 0





New Contributor
Posts: 1
Registered: ‎09-15-2017

Re: Saving Spark 2.2 dataframs in Hive table

I'm having the same problem under Cloudera CDH 5.12.1. I was previously using CDH 5.10.1 and upgraded in hope the error was resolved, but it persists in the latest version of CDH.


I have filed a bug in the Apache Spark Bugtracker describing the problem and a workaround (manually specifying path and saving as external table or manually updating Hive metastore data) here:


The problem seems to be that Spark does not write the path to Hive metastore. Any advice how to fix this?

Posts: 26
Registered: ‎11-04-2016

Re: Saving Spark 2.2 dataframs in Hive table



Sorry I forgot to come back here and say how I found a quick workaround.


So, here's how I do it:


import org.apache.spark.sql.DataFrameWriter

val options = Map("path" -> "this is the path to your warehouse") // for me every database has a different warehouse. I am not using the default warehouse. I am using users' directory for warehousing DBs and tables
//and simply write it!

So as you can see a simple path to the warehouse of the database will solve the problem. I want to say Spark 2 is not aware of these metadata, but when you look at your spark.catalog you can see everything is there! So I don't know why it can't decide where is the path to your database when you want to


Hope this helps :)