Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

[Hive 3.1.0] Missing lines when requesting ORC external table

avatar

Hi,

we are facing a strange issue with an ORC external Hive table, some lines can't be retrieved.

The context is HDP 3.1.0 with ORC files generated with a Spark 2.3.2 job into an HDFS partition with an external table mapped on this HDFS directory. Table partitions were added.

When trying to filtering this table with a specific column value, there's no result;but when accessing these files from the Spark interpreter of a Zeppelin notebook, the filter shows the expected lines.

Hive LLAP is not used for this issue and have another behaviour (data is retrieved).

 

The SQL query (JDBC tool or beeline command) that don't return any result :

select * from tracabilite.tg1pivot where sscc="330232926251080606" and `application`="LOGUSI" and month="04" and day="01";

Another SQL query that returns 2 results (within the same partition):

select * from tracabilite.tg1pivot where sscc="330232926636794272" and `application`="LOGUSI" and month="04" and day="01";

The Zeppelin notebook sample :

val logusi = spark.read.format("orc").option("header",true).load("/DEV/smart_data/TG/application=LOGUSI/year=2020/month=04/day=01")
println(logusi.count)
logusi.where(col("sscc")==="330232926251080606").show()

Please help to find and fix why some lines are not readable from Hive.

Do you have any tracks to follow, some hive/tez parameters to check or any bugs known ?

 

Thanks

Olivier

1 ACCEPTED SOLUTION

avatar

The cause has been identified; a wrong table property 'skip.header.line.count'='1' inherited from other CSV format tables that makes this strong behaviour on the ORC external table : missing lines, empty counts (due to null values ?), ...

View solution in original post

1 REPLY 1

avatar

The cause has been identified; a wrong table property 'skip.header.line.count'='1' inherited from other CSV format tables that makes this strong behaviour on the ORC external table : missing lines, empty counts (due to null values ?), ...