Support Questions

Find answers, ask questions, and share your expertise

Is there any fix or work around available for Null pointer exception in ORC -Spark

avatar

I have ORC table in hive. Im using sparkSQL to query the hive ORC table in spark. Table is partition and I have two partitions, in which one partition has data and other partition doesn't have any data. I can understand and know that there is a bug existing in spark to handle zero byte file in hive table which is stored in ORC. But I just wanted to know is there any work around available to handle this issue. Spark version up-gradation is not a choice.

1 REPLY 1

avatar
Rising Star

you should be able to use show table extended partition to see if you can get info on it and not try to open anyone who is zero bytes. Like this:

scala> var sqlCmd="show table extended from mydb like 'mytable' partition (date_time_date='2017-01-01')"
sqlCmd: String = show table extended from mydb like 'mytable' partition (date_time_date='2017-01-01')

scala> var partitionsList=sqlContext.sql(sqlCmd).collectAsList
partitionsList: java.util.List[org.apache.spark.sql.Row] =
[[mydb,mytable,false,Partition Values: [date_time_date=2017-01-01]
Location: hdfs://mycluster/apps/hive/warehouse/mydb.db/mytable/date_time_date=2017-01-01
Serde Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties: [serialization.format=1]
Partition Parameters: {rawDataSize=441433136, numFiles=1, transient_lastDdlTime=1513597358, totalSize=4897483, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numRows=37825}
]]

Let me know if that works and you can avoid the 0 byter's with such or if you still get null pointer..

James