Reply
Contributor
Posts: 52
Registered: ‎10-19-2016

save spark 2 dataframe into hive managed table

[ Edited ]

Hi, I'm using SPARK2-2.0.0.cloudera.beta2-1.cdh5.7.0.p0.110234.

 

I'm trying to save spark dataframe into hive table.

 

df.write.mode(SaveMode.Overwrite).partitionBy("date").saveAsTable(s"$databaseName.$tableName")

 

I can list the table in beeline shell. However, I cannot read the content, because the table schema is not what I expected :

 

+-----------+----------------+--------------------+--+
| col_name | data_type | comment |
+-----------+----------------+--------------------+--+
| col | array<string> | from deserializer |
+-----------+----------------+--------------------+--+

 

I've tried spark1.6.0-cdh5.9.0-hadoop2.6.0, but got the same result.

 

=== update 2016-11-29 14:50 ===

 

I realized that Spark SQL specific format, which is NOT compatible with Hive. So, I changed to:

 

  • [spark 1.6.0] df.write.mode(SaveMode.Overwrite).partitionBy("date").insertInto(s"$databaseName.$tableName")

However, every time I do a query using beeline, the Hive metastore server will crash. If I query using Impala, the metastore server works well.

 

  • [spark 2.0.0]  df.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")

the write operation succeed some times. Then can be queried by Impala but not beeline.

sometimes, the write operation failed with error ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! and the metastore server crashes.

 

Thanks.

Cloudera Employee
Posts: 20
Registered: ‎01-17-2017

Re: save spark 2 dataframe into hive managed table

from pyspark import copy_func, since
from pyspark.rdd import RDD, _load_from_socket, ignore_unicode_prefix
from pyspark.serializers import BatchedSerializer, PickleSerializer, UTF8Deserializer
from pyspark.storagelevel import StorageLevel
from pyspark.traceback_utils import SCCallSiteSync
from pyspark.sql.types import _parse_datatype_json_string
from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
from pyspark.sql.readwriter import DataFrameWriter
from pyspark.sql.streaming import DataStreamWriter
from pyspark.sql.types import *

Explorer
Posts: 19
Registered: ‎11-04-2016

Re: save spark 2 dataframe into hive managed table

Hello,

 

I have the same problem. I can see the table which was written by Spark 2.2 but I can't read it back! It shows empty row with no error. However, if the Hive table has been saved by Spark 1.6 there is no problem reading it in Spark 2.2!

 

I was curious to see how you resolved your issue.

Thanks

Announcements