Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

save spark 2 dataframe into hive managed table

Highlighted

save spark 2 dataframe into hive managed table

Rising Star

Hi, I'm using SPARK2-2.0.0.cloudera.beta2-1.cdh5.7.0.p0.110234.

 

I'm trying to save spark dataframe into hive table.

 

df.write.mode(SaveMode.Overwrite).partitionBy("date").saveAsTable(s"$databaseName.$tableName")

 

I can list the table in beeline shell. However, I cannot read the content, because the table schema is not what I expected :

 

+-----------+----------------+--------------------+--+
| col_name | data_type | comment |
+-----------+----------------+--------------------+--+
| col | array<string> | from deserializer |
+-----------+----------------+--------------------+--+

 

I've tried spark1.6.0-cdh5.9.0-hadoop2.6.0, but got the same result.

 

=== update 2016-11-29 14:50 ===

 

I realized that Spark SQL specific format, which is NOT compatible with Hive. So, I changed to:

 

  • [spark 1.6.0] df.write.mode(SaveMode.Overwrite).partitionBy("date").insertInto(s"$databaseName.$tableName")

However, every time I do a query using beeline, the Hive metastore server will crash. If I query using Impala, the metastore server works well.

 

  • [spark 2.0.0]  df.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")

the write operation succeed some times. Then can be queried by Impala but not beeline.

sometimes, the write operation failed with error ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! and the metastore server crashes.

 

Thanks.

2 REPLIES 2

Re: save spark 2 dataframe into hive managed table

Cloudera Employee

from pyspark import copy_func, since
from pyspark.rdd import RDD, _load_from_socket, ignore_unicode_prefix
from pyspark.serializers import BatchedSerializer, PickleSerializer, UTF8Deserializer
from pyspark.storagelevel import StorageLevel
from pyspark.traceback_utils import SCCallSiteSync
from pyspark.sql.types import _parse_datatype_json_string
from pyspark.sql.column import Column, _to_seq, _to_list, _to_java_column
from pyspark.sql.readwriter import DataFrameWriter
from pyspark.sql.streaming import DataStreamWriter
from pyspark.sql.types import *

Re: save spark 2 dataframe into hive managed table

Expert Contributor

Hello,

 

I have the same problem. I can see the table which was written by Spark 2.2 but I can't read it back! It shows empty row with no error. However, if the Hive table has been saved by Spark 1.6 there is no problem reading it in Spark 2.2!

 

I was curious to see how you resolved your issue.

Thanks

Don't have an account?
Coming from Hortonworks? Activate your account here