Created on 12-12-2016 09:36 AM - edited 09-16-2022 03:50 AM
Hi
I have CDH 5.7 and Kerberos, Sentry, Hive and Spark.
I've tried to create table in Hive from DF in Spark and it was created, but nothing but sqlContext can read it back.
During creation I get this WARNING:
scala> val df = sqlContext.sql("SELECT * FROM myschema.mytab") df: org.apache.spark.sql.DataFrame = [browserid: int, browser: string] scala> df.write.format("parquet").saveAsTable("myschema.mytab_v2") SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/hive-exec-1.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/hive-jdbc-1.1.0-cdh5.7.0-standalone.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/parquet-format-2.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/parquet-hadoop-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/parquet-pig-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [shaded.parquet.org.slf4j.helpers.NOPLoggerFactory] 16/12/12 16:40:28 WARN hive.HiveContext$$anon$2: Could not persist `myschema`.`mytab_v2` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User isegrim does not have privileges for CREATETABLE) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:759) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:716) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:415) ... at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: MetaException(message:User isegrim does not have privileges for CREATETABLE) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:29992) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:29960) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result.read(ThriftHiveMetastore.java:29886) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ...
The table is written to HiveMetastore, and can be read by Spark from Spark sqlContext:
scala> sqlContext.sql("select count(1) from myschema.mytable_v2").show(200, false) +---+ |_c0| +---+ |107| +---+
But can not be read from beeline and Hue:
0: jdbc:hive2://myhs2:10> select count(1) from myschema.mytable_v2; ... Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
Table schema has some spark properties:
SHOW CREATE TABLE myschema.mytable_v2; | CREATE TABLE `mytable_v2`( | `col` array<string> COMMENT 'from deserializer') | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe' | WITH SERDEPROPERTIES ( | 'path'='hdfs://myns/user/hive/warehouse/myschema.db/mytable_v2') | STORED AS INPUTFORMAT | 'org.apache.hadoop.mapred.SequenceFileInputFormat' | OUTPUTFORMAT | 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat' | LOCATION | 'hdfs://myns/user/hive/warehouse/myschema.db/mytable_v2' | TBLPROPERTIES ( | 'COLUMN_STATS_ACCURATE'='false', | 'EXTERNAL'='FALSE', | 'numFiles'='2', | 'numRows'='-1', | 'rawDataSize'='-1', | 'spark.sql.sources.provider'='parquet', | 'spark.sql.sources.schema.numParts'='1', | 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"myfield\"}]}', | 'totalSize'='2300', | 'transient_lastDdlTime'='1481557228')
Even when I set Spark as execution engine for Hive it won't work:
0: jdbc:hive2://myhs2:10> set hive.execution.engine=spark; 0: jdbc:hive2://myhs2:10> select count(1) from myschema.mytable_v2; ... Status: Running (Hive on Spark job[0]) INFO : Job Progress Format CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost] INFO : 2016-12-12 18:00:43,189 Stage-0_0: 0(+2)/2 Stage-1_0: 0/1 INFO : 2016-12-12 18:00:45,205 Stage-0_0: 0(+2,-4)/2 Stage-1_0: 0/1 ERROR : Status: Failed ERROR : FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask INFO : Completed executing command(queryId=hive_20161212180000_4d80266f-15f5-4fe2-b044-c9324fd75ba6); Time taken: 25.74 seconds Error: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=3)
User isegrim is in ldap group mygroup:
# id isegrim uid=1001(isegrim) gid=501(mygroup) groups=501(mygroup)
Sentry posess role mygroup, to which group mygroup is attached:
0: jdbc:hive2://myhs2:10> show role grant group mygroup; INFO : OK +-------+---------------+-------------+----------+--+ | role | grant_option | grant_time | grantor | +-------+---------------+-------------+----------+--+ | mygroup | false | NULL | -- | +-------+---------------+-------------+----------+--+
and this role mygroup has granted ALL privilege to myschema:
0: jdbc:hive2://myhs2:10> show grant role mygroup; INFO : OK +-------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | database | table | partition | column | principal_name | principal_type | privilege | grant_option | grant_time | grantor | +-------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | default | | | | mygroup | ROLE | select | false | 1473364008245000 | -- | | myschema | | | | mygroup | ROLE | * | false | 1473364207568000 | -- | +-------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+
I suppose all would be fine if (as WARNING says) I could persist `myschema`.`mytable_v2` in a Hive compatible way, instead persisting it into Hive metastore in Spark SQL specific format.
What to do to store tables in Hive compatible way from Spark DataFrame?
And why Sentry denies CREATETABLE in Hive compatible way, and grants in Spark SQL specific format?
Created on 12-12-2016 02:30 PM - edited 12-12-2016 02:31 PM
As usuall solution was somwhere there in the deep of the Internet 😉
http://stackoverflow.com/questions/37393017/convert-dataframe-to-hive-table-in-spark-scala
http://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html
Proposed Workaround worked for me as charm.
Thank you!
Workaround: Explicitly create a Hive table to store the data. For example:
df.registerTempTable(tempName) hsc.sql(s""" CREATE TABLE $tableName ( // field definitions ) STORED AS $format """) hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")
Created on 12-12-2016 02:30 PM - edited 12-12-2016 02:31 PM
As usuall solution was somwhere there in the deep of the Internet 😉
http://stackoverflow.com/questions/37393017/convert-dataframe-to-hive-table-in-spark-scala
http://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html
Proposed Workaround worked for me as charm.
Thank you!
Workaround: Explicitly create a Hive table to store the data. For example:
df.registerTempTable(tempName) hsc.sql(s""" CREATE TABLE $tableName ( // field definitions ) STORED AS $format """) hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")
Created 01-09-2017 10:41 PM
I have exactly the same problem, and the env is also the same.
But, I found that if you perform the procedure as the user who has ALL privilege on hive(may as: "server1"), it will suceessfully complete.And you can read table via both hive and spark.
In the log: "message:User xx does not have privileges for CREATETABLE", so I think it might have something to do with Sentry. But I stucked, anyone has a clue?
Created 01-17-2017 05:36 PM
With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), from a Hive table, or from [Spark data sources](#data-sources). As an example, the following creates a DataFrame based on the content of a JSON file: {% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}