Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark 1.4.1 On HDP-2.3.2.0-2950 Querying ORC table gives InvalidProtocolBufferException: Message missing required fields

Spark 1.4.1 On HDP-2.3.2.0-2950 Querying ORC table gives InvalidProtocolBufferException: Message missing required fields

New Contributor

Hi All

I want to use Spark SQL (pyspark) to process some telematics data from an ORC table in Hive.

When I query an external CSV table all is fine however all ORC tables give me an InvalidProtocolBufferException

Here is the code that I use

-------------------------------------------------

import pyspark

from pyspark import HiveContext

sqlContext = HiveContext(sc)

sqlStr = r'select count(*) as total from default.orc_table where vehicle_id = 39982241'

vehicleDF = sqlContext.sql(sqlStr)

print vehicleDF

print vehicleDF.first()

-------------------------------------------------

This gives the following error

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 16.0 failed 1 times, most recent failure: Lost task 2.0 in stage 16.0 (TID 38, localhost): org.spark_project.protobuf.InvalidProtocolBufferException: Message missing required fields: streams[2].kind, streams[4].kind
	at org.spark_project.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:81)
	at org.spark_project.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:71)
	at org.spark_project.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
	at org.spark_project.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
	at org.spark_project.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
	at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:8878)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2174)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2505)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:2949)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:2991)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:284)
	at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:480)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createReaderFromFile(OrcInputFormat.java:214)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.<init>(OrcInputFormat.java:146)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1010)
	at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:239)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
Google searches do not give much information back.  When I run this on the HDP 2.3 sandbox the queries execute successfully. 
Has anyone seen similar behavior and found a way around it? Please let me know
Derck
Don't have an account?
Coming from Hortonworks? Activate your account here