Reply
New Contributor
Posts: 2
Registered: ‎09-28-2018

Count Error on newly created partition data:org.apache.spark.sql.catalyst.errors.package

Hey,

 

I am running a pyspark program that creates and load the partitions on target table (Parquet) by reading from extrenal table. While till here it works fine but when I introduce a step where I take a count of records from the newly created partition it fails. with below error.

 

To set the context I checked partition hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08 has 9 split files ie part-00000 to part-00008 but the org.apache.spark.sql.catalyst.errors.package shows its searching for part-00009.. not sure why..

 

Have tried msck repair post inserting into partition as well but no luck... 

CDH 5.8.1

python 2.6.6
spark 1.6.0
hive 1.1.0

 

can anyone help .

 

Below the logs

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

19/01/02 03:59:52 INFO metadata.Hive: Replacing src:hdfs://TSTSER/staging/tablename/.hive-staging_hive_2019-01-02_03-59-03_589_9209689873195803423-1/-ext-10000/part-00008, dest: hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00008, Status:true
setfacl: Permission denied. user=bdgatetl is not the owner of inode=part-00008
Statement Executing  >> SELECT COUNT(1) cnt_tgt FROM staging.tablename WHERE as_of_date='2018-10-08'
19/01/02 03:59:52 INFO datasources.DataSourceStrategy: Selected 1 partitions out of 12, pruned 91.66666666666666% partitions.
19/01/02 03:59:52 INFO storage.MemoryStore: Block broadcast_50 stored as values in memory (estimated size 300.5 KB, free 2.8 MB)
19/01/02 03:59:52 INFO storage.MemoryStore: Block broadcast_50_piece0 stored as bytes in memory (estimated size 25.0 KB, free 2.8 MB)
19/01/02 03:59:52 INFO storage.BlockManagerInfo: Added broadcast_50_piece0 in memory on 10.61.62.8:43317 (size: 25.0 KB, free: 529.6 MB)
19/01/02 03:59:52 INFO spark.SparkContext: Created broadcast 50 from javaToPython at null:-1
19/01/02 03:59:52 INFO parquet.ParquetRelation: Reading Parquet file(s) from hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00000, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00001, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00002, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00003, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00004, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00005, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00006, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00007, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00008, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00009, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00010, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00011, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00012, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08/part-00013
An error occurred while calling o651.javaToPython.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[cnt_tgt#2737])
+- TungstenExchange SinglePartition, None
   +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#3492L])
      +- Scan ParquetRelation: staging.tablename[] InputPaths: hdfs://TSTSER/staging/tablename/as_of_date=2018-09-17, hdfs://TSTSER/staging/tablename/as_of_date=2018-09-19, hdfs://TSTSER/staging/tablename/as_of_date=2018-09-20, hdfs://TSTSER/staging/tablename/as_of_date=2018-09-25, hdfs://TSTSER/staging/tablename/as_of_date=2018-09-28, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-02, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-04, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-05, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-09, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-31, hdfs://TSTSER/staging/tablename/as_of_date=2018-11-20
 
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
        at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
        at org.apache.spark.sql.DataFrame.javaToPython(DataFrame.scala:1733)
        at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#3492L])
   +- Scan ParquetRelation: staging.tablename[] InputPaths: hdfs://TSTSER/staging/tablename/as_of_date=2018-09-17, hdfs://TSTSER/staging/tablename/as_of_date=2018-09-19, hdfs://TSTSER/staging/tablename/as_of_date=2018-09-20, hdfs://TSTSER/staging/tablename/as_of_date=2018-09-25, hdfs://TSTSER/staging/tablename/as_of_date=2018-09-28, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-02, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-04, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-05, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-08, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-09, hdfs://TSTSER/staging/tablename/as_of_date=2018-10-31, hdfs://TSTSER/staging/tablename/as_of_date=2018-11-20
 
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
        at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:86)
        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:80)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
        ... 18 more
Caused by: java.io.FileNotFoundException: File does not exist: /staging/tablename/as_of_date=2018-10-08/part-00009