Support Questions

Find answers, ask questions, and share your expertise

[CDH 5.7] Can't create table in Hive compatible way from Spark DataFrame with sqlContext

avatar
Rising Star

Hi

 

I have CDH 5.7 and Kerberos, Sentry, Hive and Spark.

I've tried to create table in Hive from DF in Spark and it was created, but nothing but sqlContext can read it back.

During creation I get this WARNING:

scala> val df = sqlContext.sql("SELECT * FROM myschema.mytab")
df: org.apache.spark.sql.DataFrame = [browserid: int, browser: string]

scala> df.write.format("parquet").saveAsTable("myschema.mytab_v2")
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/hive-exec-1.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/hive-jdbc-1.1.0-cdh5.7.0-standalone.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/parquet-format-2.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/parquet-hadoop-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/parquet-pig-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [shaded.parquet.org.slf4j.helpers.NOPLoggerFactory]
16/12/12 16:40:28 WARN hive.HiveContext$$anon$2: Could not persist `myschema`.`mytab_v2` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User isegrim does not have privileges for CREATETABLE)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:759)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:716)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:415)

...

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: MetaException(message:User isegrim does not have privileges for CREATETABLE)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:29992)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:29960)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result.read(ThriftHiveMetastore.java:29886)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)

...

 

The table is written to HiveMetastore, and can be read by Spark from Spark sqlContext:

scala> sqlContext.sql("select count(1) from myschema.mytable_v2").show(200, false)
+---+
|_c0|
+---+
|107|
+---+

 

But can not be read from beeline and Hue:

0: jdbc:hive2://myhs2:10> select count(1) from myschema.mytable_v2;
...
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)



Table schema has some spark properties:

SHOW CREATE TABLE myschema.mytable_v2;
| CREATE TABLE `mytable_v2`(
| `col` array<string> COMMENT 'from deserializer')
| ROW FORMAT SERDE
| 'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe'
| WITH SERDEPROPERTIES (
| 'path'='hdfs://myns/user/hive/warehouse/myschema.db/mytable_v2')
| STORED AS INPUTFORMAT
| 'org.apache.hadoop.mapred.SequenceFileInputFormat'
| OUTPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
| LOCATION
| 'hdfs://myns/user/hive/warehouse/myschema.db/mytable_v2'
| TBLPROPERTIES (
| 'COLUMN_STATS_ACCURATE'='false',
| 'EXTERNAL'='FALSE',
| 'numFiles'='2',
| 'numRows'='-1',
| 'rawDataSize'='-1',
| 'spark.sql.sources.provider'='parquet',
| 'spark.sql.sources.schema.numParts'='1',
| 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"myfield\"}]}',
| 'totalSize'='2300',
| 'transient_lastDdlTime'='1481557228')

Even when I set Spark as execution engine for Hive it won't work:

0: jdbc:hive2://myhs2:10> set hive.execution.engine=spark;

0: jdbc:hive2://myhs2:10> select count(1) from myschema.mytable_v2;

...

Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2016-12-12 18:00:43,189 Stage-0_0: 0(+2)/2 Stage-1_0: 0/1
INFO : 2016-12-12 18:00:45,205 Stage-0_0: 0(+2,-4)/2 Stage-1_0: 0/1
ERROR : Status: Failed
ERROR : FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
INFO : Completed executing command(queryId=hive_20161212180000_4d80266f-15f5-4fe2-b044-c9324fd75ba6); Time taken: 25.74 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=3)

 

User isegrim is in ldap group mygroup:

# id isegrim
uid=1001(isegrim) gid=501(mygroup) groups=501(mygroup)

 

Sentry posess role mygroup, to which group mygroup is attached:

0: jdbc:hive2://myhs2:10> show role grant group mygroup;

INFO : OK
+-------+---------------+-------------+----------+--+
| role | grant_option | grant_time | grantor |
+-------+---------------+-------------+----------+--+
| mygroup | false | NULL | -- |
+-------+---------------+-------------+----------+--+

 

and this role mygroup has granted ALL privilege to myschema:

0: jdbc:hive2://myhs2:10> show grant role mygroup;

INFO : OK
+-------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+
| database | table | partition | column | principal_name | principal_type | privilege | grant_option | grant_time | grantor |
+-------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+
| default | | | | mygroup | ROLE | select | false | 1473364008245000 | -- |
| myschema | | | | mygroup | ROLE | * | false | 1473364207568000 | -- |
+-------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+

 

I suppose all would be fine if (as WARNING says) I could persist `myschema`.`mytable_v2` in a Hive compatible way, instead persisting it into Hive metastore in Spark SQL specific format.

 

What to do to store tables in Hive compatible way from Spark DataFrame?
And why Sentry denies CREATETABLE in Hive compatible way, and grants in Spark SQL specific format?

 

1 ACCEPTED SOLUTION

avatar
Rising Star

As usuall solution was somwhere there in the deep of the Internet 😉

 

http://stackoverflow.com/questions/37393017/convert-dataframe-to-hive-table-in-spark-scala

 

http://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html

 

Proposed Workaround worked for me as charm.

Thank you!

 

Workaround: Explicitly create a Hive table to store the data. For example:

df.registerTempTable(tempName)
hsc.sql(s"""
CREATE TABLE $tableName (
// field definitions   )
STORED AS $format """)
hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")

 

View solution in original post

3 REPLIES 3

avatar
Rising Star

As usuall solution was somwhere there in the deep of the Internet 😉

 

http://stackoverflow.com/questions/37393017/convert-dataframe-to-hive-table-in-spark-scala

 

http://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html

 

Proposed Workaround worked for me as charm.

Thank you!

 

Workaround: Explicitly create a Hive table to store the data. For example:

df.registerTempTable(tempName)
hsc.sql(s"""
CREATE TABLE $tableName (
// field definitions   )
STORED AS $format """)
hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")

 

avatar
Explorer

I have exactly the same problem, and the env is also the same.

But, I found that if you perform the procedure as the user who has ALL privilege on hive(may as: "server1"), it will suceessfully complete.And you can read table via both hive and spark.

In the log: "message:User xx does not have privileges for CREATETABLE", so I think it might have something to do with Sentry. But I stucked, anyone has a clue?

avatar
Contributor

With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), from a Hive table, or from [Spark data sources](#data-sources). As an example, the following creates a DataFrame based on the content of a JSON file: {% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}