Reply
Contributor
Posts: 31
Registered: ‎11-04-2016
Accepted Solution

Saving Spark 2.2 dataframs in Hive table

[ Edited ]

Hello,

 

I have a problem with Spark 2.2 (latest CDH 5.12.0) and saving DataFrame into Hive table.

 

Things I can do:

 

1. I can easily read tables from Hive tables in Spark 2.2

2. I can do saveAsTable in Spark 1.6 into Hive table and read it from Spark 2.2

3. I can do write.saveAsTable in Spark 2.2 and see the files and data inside Hive table

 

 

Things I cannot do in Spark 2.2:

 

4. When I read Hive table saved by Spark 2.2 in spark2-shell, it shows empty rows. It has all the fields and schema but no data.

 

I don't understand what could cause this problem.

Any help would be appreciate it.

 

example:

 

scala> val df = sc.parallelize(
     |   Seq(
     |     ("first", Array(2.0, 1.0, 2.1, 5.4)),
     |     ("test", Array(1.5, 0.5, 0.9, 3.7)),
     |     ("choose", Array(8.0, 2.9, 9.1, 2.5))
     |   ), 3
     | ).toDF
df: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>]

scala> df.show
+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| first|[2.0, 1.0, 2.1, 5.4]|
|  test|[1.5, 0.5, 0.9, 3.7]|
|choose|[8.0, 2.9, 9.1, 2.5]|
+------+--------------------+

scala> df.write.saveAsTable("database.test")

scala> val savedDF = spark.sql("SELECT * FROM database.test")
res45: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>]

scala> savedDF.show
+---+---+
|_1|_2|
+---+---+
+---+---+
scala> savedDF.count
res55: Long = 0

 

 

Thanks

 

New Contributor
Posts: 1
Registered: ‎09-15-2017

Re: Saving Spark 2.2 dataframs in Hive table

I'm having the same problem under Cloudera CDH 5.12.1. I was previously using CDH 5.10.1 and upgraded in hope the error was resolved, but it persists in the latest version of CDH.

 

I have filed a bug in the Apache Spark Bugtracker describing the problem and a workaround (manually specifying path and saving as external table or manually updating Hive metastore data) here: https://issues.apache.org/jira/browse/SPARK-21994

 

The problem seems to be that Spark does not write the path to Hive metastore. Any advice how to fix this?

Highlighted
Contributor
Posts: 31
Registered: ‎11-04-2016

Re: Saving Spark 2.2 dataframs in Hive table

Hi,

 

Sorry I forgot to come back here and say how I found a quick workaround.

 

So, here's how I do it:

 

import org.apache.spark.sql.DataFrameWriter

val options = Map("path" -> "this is the path to your warehouse") // for me every database has a different warehouse. I am not using the default warehouse. I am using users' directory for warehousing DBs and tables
//and simply write it!
df.write.options(options).saveAsTable("db_name.table_name")

So as you can see a simple path to the warehouse of the database will solve the problem. I want to say Spark 2 is not aware of these metadata, but when you look at your spark.catalog you can see everything is there! So I don't know why it can't decide where is the path to your database when you want to write.save.

 

Hope this helps :)

 

New Contributor
Posts: 1
Registered: ‎03-06-2018

Re: Saving Spark 2.2 dataframs in Hive table

This workaround has a severe problem.

 

val options = Map("path" -> "this is the path to your warehouse")

Do NOT do this. When you specify the "path" as just the warehouse location, Spark will assume that is the location that needs to be purged during an overwrite. This can wipe everything in your warehouse. So, if you put "user/hive/warehouse" it will delete everything in "user/hive/warehouse". This is bad and should not have been marked as the accepted answer.

 

I think the only reason maziyar didn't have everything wiped is because he is using separate warehouses for each db...or he is actually specifying the full path to each table.

Contributor
Posts: 31
Registered: ‎11-04-2016

Re: Saving Spark 2.2 dataframs in Hive table

This is a good point to keep in mind when someone chooses this method since this is the only workaround for saving tables inside Hive in the older version of Cloudera.

 

NOTE: This problem only happened in Cloudera and has been solved in a newer version (Hive/Spark). As of CDH 5.14.0 this has been completely resolved.

Announcements