Reply
Explorer
Posts: 19
Registered: ‎11-04-2016
Accepted Solution

Saving Spark 2.2 dataframs in Hive table

[ Edited ]

Hello,

 

I have a problem with Spark 2.2 (latest CDH 5.12.0) and saving DataFrame into Hive table.

 

Things I can do:

 

1. I can easily read tables from Hive tables in Spark 2.2

2. I can do saveAsTable in Spark 1.6 into Hive table and read it from Spark 2.2

3. I can do write.saveAsTable in Spark 2.2 and see the files and data inside Hive table

 

 

Things I cannot do in Spark 2.2:

 

4. When I read Hive table saved by Spark 2.2 in spark2-shell, it shows empty rows. It has all the fields and schema but no data.

 

I don't understand what could cause this problem.

Any help would be appreciate it.

 

example:

 

scala> val df = sc.parallelize(
     |   Seq(
     |     ("first", Array(2.0, 1.0, 2.1, 5.4)),
     |     ("test", Array(1.5, 0.5, 0.9, 3.7)),
     |     ("choose", Array(8.0, 2.9, 9.1, 2.5))
     |   ), 3
     | ).toDF
df: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>]

scala> df.show
+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| first|[2.0, 1.0, 2.1, 5.4]|
|  test|[1.5, 0.5, 0.9, 3.7]|
|choose|[8.0, 2.9, 9.1, 2.5]|
+------+--------------------+

scala> df.write.saveAsTable("database.test")

scala> val savedDF = spark.sql("SELECT * FROM database.test")
res45: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>]

scala> savedDF.show
+---+---+
|_1|_2|
+---+---+
+---+---+
scala> savedDF.count
res55: Long = 0

 

 

Thanks

 

New Contributor
Posts: 1
Registered: ‎09-15-2017

Re: Saving Spark 2.2 dataframs in Hive table

I'm having the same problem under Cloudera CDH 5.12.1. I was previously using CDH 5.10.1 and upgraded in hope the error was resolved, but it persists in the latest version of CDH.

 

I have filed a bug in the Apache Spark Bugtracker describing the problem and a workaround (manually specifying path and saving as external table or manually updating Hive metastore data) here: https://issues.apache.org/jira/browse/SPARK-21994

 

The problem seems to be that Spark does not write the path to Hive metastore. Any advice how to fix this?

Highlighted
Explorer
Posts: 19
Registered: ‎11-04-2016

Re: Saving Spark 2.2 dataframs in Hive table

Hi,

 

Sorry I forgot to come back here and say how I found a quick workaround.

 

So, here's how I do it:

 

import org.apache.spark.sql.DataFrameWriter

val options = Map("path" -> "this is the path to your warehouse") // for me every database has a different warehouse. I am not using the default warehouse. I am using users' directory for warehousing DBs and tables
//and simply write it!
df.write.options(options).saveAsTable("db_name.table_name")

So as you can see a simple path to the warehouse of the database will solve the problem. I want to say Spark 2 is not aware of these metadata, but when you look at your spark.catalog you can see everything is there! So I don't know why it can't decide where is the path to your database when you want to write.save.

 

Hope this helps :)

 

Announcements