Created on 08-05-2017 12:57 PM - edited 09-16-2022 05:03 AM
Hello,
I have a problem with Spark 2.2 (latest CDH 5.12.0) and saving DataFrame into Hive table.
Things I can do:
1. I can easily read tables from Hive tables in Spark 2.2
2. I can do saveAsTable in Spark 1.6 into Hive table and read it from Spark 2.2
3. I can do write.saveAsTable in Spark 2.2 and see the files and data inside Hive table
Things I cannot do in Spark 2.2:
4. When I read Hive table saved by Spark 2.2 in spark2-shell, it shows empty rows. It has all the fields and schema but no data.
I don't understand what could cause this problem.
Any help would be appreciate it.
example:
scala> val df = sc.parallelize( | Seq( | ("first", Array(2.0, 1.0, 2.1, 5.4)), | ("test", Array(1.5, 0.5, 0.9, 3.7)), | ("choose", Array(8.0, 2.9, 9.1, 2.5)) | ), 3 | ).toDF df: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>] scala> df.show +------+--------------------+ | _1| _2| +------+--------------------+ | first|[2.0, 1.0, 2.1, 5.4]| | test|[1.5, 0.5, 0.9, 3.7]| |choose|[8.0, 2.9, 9.1, 2.5]| +------+--------------------+ scala> df.write.saveAsTable("database.test") scala> val savedDF = spark.sql("SELECT * FROM database.test") res45: org.apache.spark.sql.DataFrame = [_1: string, _2: array<double>] scala> savedDF.show +---+---+ |_1|_2| +---+---+ +---+---+ scala> savedDF.count res55: Long = 0
Thanks
Created 09-16-2017 04:36 AM
Hi,
Sorry I forgot to come back here and say how I found a quick workaround.
So, here's how I do it:
import org.apache.spark.sql.DataFrameWriter val options = Map("path" -> "this is the path to your warehouse") // for me every database has a different warehouse. I am not using the default warehouse. I am using users' directory for warehousing DBs and tables //and simply write it! df.write.options(options).saveAsTable("db_name.table_name")
So as you can see a simple path to the warehouse of the database will solve the problem. I want to say Spark 2 is not aware of these metadata, but when you look at your spark.catalog you can see everything is there! So I don't know why it can't decide where is the path to your database when you want to write.save.
Hope this helps 🙂
Created 09-15-2017 05:11 AM
I'm having the same problem under Cloudera CDH 5.12.1. I was previously using CDH 5.10.1 and upgraded in hope the error was resolved, but it persists in the latest version of CDH.
I have filed a bug in the Apache Spark Bugtracker describing the problem and a workaround (manually specifying path and saving as external table or manually updating Hive metastore data) here: https://issues.apache.org/jira/browse/SPARK-21994
The problem seems to be that Spark does not write the path to Hive metastore. Any advice how to fix this?
Created 09-16-2017 04:36 AM
Hi,
Sorry I forgot to come back here and say how I found a quick workaround.
So, here's how I do it:
import org.apache.spark.sql.DataFrameWriter val options = Map("path" -> "this is the path to your warehouse") // for me every database has a different warehouse. I am not using the default warehouse. I am using users' directory for warehousing DBs and tables //and simply write it! df.write.options(options).saveAsTable("db_name.table_name")
So as you can see a simple path to the warehouse of the database will solve the problem. I want to say Spark 2 is not aware of these metadata, but when you look at your spark.catalog you can see everything is there! So I don't know why it can't decide where is the path to your database when you want to write.save.
Hope this helps 🙂
Created 03-06-2018 09:26 AM
This workaround has a severe problem.
val options = Map("path" -> "this is the path to your warehouse")
Do NOT do this. When you specify the "path" as just the warehouse location, Spark will assume that is the location that needs to be purged during an overwrite. This can wipe everything in your warehouse. So, if you put "user/hive/warehouse" it will delete everything in "user/hive/warehouse". This is bad and should not have been marked as the accepted answer.
I think the only reason maziyar didn't have everything wiped is because he is using separate warehouses for each db...or he is actually specifying the full path to each table.
Created 03-06-2018 03:11 PM
This is a good point to keep in mind when someone chooses this method since this is the only workaround for saving tables inside Hive in the older version of Cloudera.
NOTE: This problem only happened in Cloudera and has been solved in a newer version (Hive/Spark). As of CDH 5.14.0 this has been completely resolved.
Created 06-22-2018 01:17 AM
Hi
I'm still get error in CDH 14 and spark2-shell