Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Error in querying Hive table created in parquet format

avatar
Expert Contributor

hello - i'm getting error in querying Hive table (data in Parquet format) :

select * from powerpoll_k1 where year=2017 and month=12 and day=11 limit 5;

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 102.0 failed 4 times, most recent failure: Lost task 0.3 in stage 102.0 (TID 20602, msc02-jag-dn-016.uat.gdcs.apple.com): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionaryat org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52)at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:274)at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)at org.apache.spark.scheduler.Task.run(Task.scala:86)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)

Table DDL is shown below :

CREATE EXTERNAL TABLE `powerpoll_k1`(`topic_k` varchar(255), `partition_k` int, `offset_k` bigint, `timestamp_k` timestamp, `deviceid` bigint, `devicename` varchar(50), `deviceip` varchar(128), `peerid` int, `objectid` int, `objectname` varchar(256), `objectdesc` varchar(256), `oid` varchar(50), `pduoutlet` varchar(50), `pluginid` int, `pluginname` varchar(255), `indicatorid` int, `indicatorname` varchar(255), `format` int, `snmppollvalue` varchar(128) COMMENT 'value in kafka avsc', `time` double, `clustername` varchar(50) COMMENT 'rpp or power', `peerip` varchar(50))COMMENT 'external table at /apps/hive/warehouse/amp.db/sevone_power'PARTITIONED BY (`year` int, `month` int, `day` int)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH SERDEPROPERTIES ('serialization.format' = '1')STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION 'hdfs://application/apps/hive/warehouse/amp.db/power'TBLPROPERTIES ('transient_lastDdlTime' = '1513022286')

any ideas on what the issue is ?

2 REPLIES 2

avatar
Expert Contributor

@bkosaraju, @mqureshi - any ideas on this ?

avatar
Expert Contributor

fyi .. i re-checked this & the issue seems to be when i include the column - deviceid bigint - in the query