Reply
Highlighted
Explorer
Posts: 9
Registered: ‎11-23-2015

spark.sql.parquet.cacheMetadata is not work for spark hivethriftserver

we used SparkSQL/ sparkApplication or other ETL tools to generate the data of hiveTable in parquet format, also we are running a spark hivethriftserver to query those tables , sometimes we find that we can't query some tables ,the error logs likes "Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00186-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00187-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00188-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00189-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00190-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00191-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00192-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00193-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00194-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00195-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00196-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet Input path does not exist: hdfs://bigdatacluster1/hiveweb/recommend.db/user2brandidtable/part-r-00197-a165f573-8454-4df1-b37b-a8cdd414f2de.gz.parquet at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:339) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$buildInternalScan$1$$anon$1$$anon$4.listStatus(ParquetRelation.scala:358) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) at org.apache.parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:294) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$buildInternalScan$1$$anon$1.getPartitions(ParquetRelation.scala:363) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2087) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1499) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$collect$1.apply(DataFrame.scala:1504) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$collect$1.apply(DataFrame.scala:1504) at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2100) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1504) at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1481) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExe cuteStatementOperation.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)" so we changed spark.sql.parquet.cacheMetadata to false and restarted spark hivethriftserver, but not working except for executing “refresh table xxx” manually Is there someone know the issue?