Support Questions

zampJeri · ‎03-23-2021

Hi community,

We run Spark 2.3.2 on Hadoop 3.1.1.

We use external ORC tables stored on HDFS.

We are encountering an issue on a job run under CRON when issuing the command `sql("msck repair table db.some_table")`. The table is partitioned and the issue is the following:

21/03/22 22:44:13 WARN HiveConf: HiveConf of name hive.heapsize does not exist
21/03/22 22:44:13 WARN HiveConf: HiveConf of name hive.stats.fetch.partition.stats does not exist
21/03/22 22:44:13 WARN HiveConf: HiveConf of name hive.plan.serialization.format does not exist
Hive Session ID = 2625af79-e021-4b57-9435-e0fea4f00803
21/03/22 22:44:13 INFO SessionState: Hive Session ID = 2625af79-e021-4b57-9435-e0fea4f00803
21/03/22 22:44:13 ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.io.IOException: Unable to create directory /tmp/hive/2625af79-e021-4b57-9435-e0fea4f00803_resources;
org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.io.IOException: Unable to create directory /tmp/hive/2625af79-e021-4b57-9435-e0fea4f00803_resources;
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
        at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
        at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
        at org.apache.spark.sql.hive.HiveSessionStateBuilder.org$apache$spark$sql$hive$HiveSessionStateBuilder$$externalCatalog(HiveSessionStateBuilder.scala:39)
        at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$1.apply(HiveSessionStateBuilder.scala:53)
        at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$1.apply(HiveSessionStateBuilder.scala:53)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:90)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:90)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:237)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(SessionCatalog.scala:176)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:400)
        at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.getMetaData(ExternalCatalogUtils.scala:265)
        at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.throwIfRO(ExternalCatalogUtils.scala:310)
        at org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:117)
        at org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:85)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
        at org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:85)
        at org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:83)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
        at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
        at scala.collection.immutable.List.foldLeft(List.scala:84)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:124)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103)
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)

The same code that causes this issue is not giving any error on another environment, and all the other flows using the command do not have issues with it. As a side effect, it seems also that the table that was populated before issuing the partitions' repair is producing double entries for each new record.

I'm not sure if it's a permissions' problem, though that would be quite unusual, as all the other flows have never encountered problems with the same command, in case the command needed to use temporary files to store e.g. metastore information.

Might it be a problem with dependencies? HBase is involved initially to read from some sources. LLAP usage is avoided.

The code looks like:

df
.write
.format("orc")
.mode("append")
.partitionBy(singleColumn)
.option("compression", "snappy")
.save(hdfsPath)

sql(s"msck repair table $tableOfInterest") // $tableOfInterest = db.some_table

Thanks a lot in advance!

Cheers

zampJeri · ‎03-31-2021

Ok, we found the very stupid issue.

This specific job running as standalone was passing the "hive-site.xml" as file to the spark-submit, whereas all other jobs run under Oozie and make use of a generic spark-submit that doesnt pass the "hive-site.xml" file. This file specifies /tmp/hive as default directory to dump temporary resources and it came out that our user still has issues with that folder, issues that are being investigated. The workaround so far is to not pass the hive-site.xml file, so the default directory is instead /tmp, where we can happily live without issues.

All in all, it was a stupid "mistake" that let us know about other issues with out current system.

Cheers and thanks to all for the support!

View solution in original post

RangaReddy · ‎03-23-2021

Hi @zampJeri

Could you please let me know from which user you are running the spark application. Check that user is having creating files/directory access under /tmp/hive directory.

zampJeri · ‎03-23-2021

Hi @RangaReddy ,

Thanks for the reply.

If I do a simple

hdfs dfs -ls /tmp/hive

I see:

ls: Permission denied: user={myUser}  access=READ_EXECUTE, inode="/tmp/hive":hive:hdfs:drwx-wx-wx

I guess that msck repair is using that folder to store temporary files. Is it because the spark submit suggests

--conf spark.datasource.hive.warehouse.load.staging.dir="/tmp"

?

Thanks

mugdha · ‎03-23-2021

@zampJeri This /tmp is about the OS file system, not HDFS. It wants to create the _resources files and unable. Does the user have permissions on /tmp/hive?

RangaReddy · ‎03-24-2021

@zampJeri

Yes one of operation write or msck repair command is using temp directory. Current running user is not having create directory permission. Could you please give the proper permission and re run the job.

zampJeri · ‎03-24-2021

@RangaReddy @mugdha

Hi, thanks for the replies.

The user has all the permissions to write to /tmp and subfolders.

We are currently investigating other parts of the code, even if the exception points to the specific line of the msck repair command. As far as I knew, that command would throw an exception if dealing with non-partitioned tables, but indeed the table under interest is partitioned. I'm not sure if an empty table could give troubles, but then other jobs should break just the same occasionally (especially the same code under a different environment - and it should be the same if considering authentication files passed to the submit).

In the beginning, we were using the Hive Warehouse Connector by means of

hive.execute("msck repair table etc...")

but we were told to stay away from triggering unnecessary LLAP (that was giving us a lot of troubles generally), so we removed all instances of HWC and all jobs run just fine with spark.sql.

Cheers!

zampJeri · ‎03-31-2021

Ok, we found the very stupid issue.

This specific job running as standalone was passing the "hive-site.xml" as file to the spark-submit, whereas all other jobs run under Oozie and make use of a generic spark-submit that doesnt pass the "hive-site.xml" file. This file specifies /tmp/hive as default directory to dump temporary resources and it came out that our user still has issues with that folder, issues that are being investigated. The workaround so far is to not pass the hive-site.xml file, so the default directory is instead /tmp, where we can happily live without issues.

All in all, it was a stupid "mistake" that let us know about other issues with out current system.

Cheers and thanks to all for the support!