Support Questions

maziyar · ‎08-05-2017

Hello,

I have a problem with Spark 2.2 (latest CDH 5.12.0) and saving DataFrame into Hive table.

Things I can do:

1. I can easily read tables from Hive tables in Spark 2.2

2. I can do saveAsTable in Spark 1.6 into Hive table and read it from Spark 2.2

3. I can do write.saveAsTable in Spark 2.2 and see the files and data inside Hive table

Things I cannot do in Spark 2.2:

4. When I read Hive table saved by Spark 2.2 in spark2-shell, it shows empty rows. It has all the fields and schema but no data.

I don't understand what could cause this problem.

Any help would be appreciate it.

example:

scala> val df = sc.parallelize(
     |   Seq(
     |     ("first", Array(2.0, 1.0, 2.1, 5.4)),
     |     ("test", Array(1.5, 0.5, 0.9, 3.7)),
     |     ("choose", Array(8.0, 2.9, 9.1, 2.5))
     |   ), 3
     | ).toDF
df: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>]

scala> df.show
+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| first|[2.0, 1.0, 2.1, 5.4]|
|  test|[1.5, 0.5, 0.9, 3.7]|
|choose|[8.0, 2.9, 9.1, 2.5]|
+------+--------------------+

scala> df.write.saveAsTable("database.test")

scala> val savedDF = spark.sql("SELECT * FROM database.test")
res45: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>]

scala> savedDF.show
+---+---+
|_1|_2|
+---+---+
+---+---+
scala> savedDF.count
res55: Long = 0

Thanks

maziyar · ‎09-16-2017

Hi,

Sorry I forgot to come back here and say how I found a quick workaround.

So, here's how I do it:

import org.apache.spark.sql.DataFrameWriter

val options = Map("path" -> "this is the path to your warehouse") // for me every database has a different warehouse. I am not using the default warehouse. I am using users' directory for warehousing DBs and tables
//and simply write it!
df.write.options(options).saveAsTable("db_name.table_name")

So as you can see a simple path to the warehouse of the database will solve the problem. I want to say Spark 2 is not aware of these metadata, but when you look at your spark.catalog you can see everything is there! So I don't know why it can't decide where is the path to your database when you want to write.save.

Hope this helps 🙂

View solution in original post

pederpansen · ‎09-15-2017

I'm having the same problem under Cloudera CDH 5.12.1. I was previously using CDH 5.10.1 and upgraded in hope the error was resolved, but it persists in the latest version of CDH.

I have filed a bug in the Apache Spark Bugtracker describing the problem and a workaround (manually specifying path and saving as external table or manually updating Hive metastore data) here: https://issues.apache.org/jira/browse/SPARK-21994

The problem seems to be that Spark does not write the path to Hive metastore. Any advice how to fix this?

maziyar · ‎09-16-2017

Hi,

Sorry I forgot to come back here and say how I found a quick workaround.

So, here's how I do it:

import org.apache.spark.sql.DataFrameWriter

val options = Map("path" -> "this is the path to your warehouse") // for me every database has a different warehouse. I am not using the default warehouse. I am using users' directory for warehousing DBs and tables
//and simply write it!
df.write.options(options).saveAsTable("db_name.table_name")

So as you can see a simple path to the warehouse of the database will solve the problem. I want to say Spark 2 is not aware of these metadata, but when you look at your spark.catalog you can see everything is there! So I don't know why it can't decide where is the path to your database when you want to write.save.

Hope this helps 🙂

aeastman · ‎03-06-2018

This workaround has a severe problem.

val options = Map("path" -> "this is the path to your warehouse")

Do NOT do this. When you specify the "path" as just the warehouse location, Spark will assume that is the location that needs to be purged during an overwrite. This can wipe everything in your warehouse. So, if you put "user/hive/warehouse" it will delete everything in "user/hive/warehouse". This is bad and should not have been marked as the accepted answer.

I think the only reason maziyar didn't have everything wiped is because he is using separate warehouses for each db...or he is actually specifying the full path to each table.

maziyar · ‎03-06-2018

This is a good point to keep in mind when someone chooses this method since this is the only workaround for saving tables inside Hive in the older version of Cloudera.

NOTE: This problem only happened in Cloudera and has been solved in a newer version (Hive/Spark). As of CDH 5.14.0 this has been completely resolved.

kiakaku_vietnam · ‎06-22-2018

Hi

I'm still get error in CDH 14 and spark2-shell

Cloudera Community

Support Questions

Saving Spark 2.2 dataframs in Hive table