Created on 04-08-2020 10:49 AM - last edited on 04-08-2020 11:39 AM by cjervis
Hi,
we are facing a strange issue with an ORC external Hive table, some lines can't be retrieved.
The context is HDP 3.1.0 with ORC files generated with a Spark 2.3.2 job into an HDFS partition with an external table mapped on this HDFS directory. Table partitions were added.
When trying to filtering this table with a specific column value, there's no result;but when accessing these files from the Spark interpreter of a Zeppelin notebook, the filter shows the expected lines.
Hive LLAP is not used for this issue and have another behaviour (data is retrieved).
The SQL query (JDBC tool or beeline command) that don't return any result :
select * from tracabilite.tg1pivot where sscc="330232926251080606" and `application`="LOGUSI" and month="04" and day="01";
Another SQL query that returns 2 results (within the same partition):
select * from tracabilite.tg1pivot where sscc="330232926636794272" and `application`="LOGUSI" and month="04" and day="01";
The Zeppelin notebook sample :
val logusi = spark.read.format("orc").option("header",true).load("/DEV/smart_data/TG/application=LOGUSI/year=2020/month=04/day=01")
println(logusi.count)
logusi.where(col("sscc")==="330232926251080606").show()
Please help to find and fix why some lines are not readable from Hive.
Do you have any tracks to follow, some hive/tez parameters to check or any bugs known ?
Thanks
Olivier
Created 04-08-2020 11:47 PM
The cause has been identified; a wrong table property 'skip.header.line.count'='1' inherited from other CSV format tables that makes this strong behaviour on the ORC external table : missing lines, empty counts (due to null values ?), ...
Created 04-08-2020 11:47 PM
The cause has been identified; a wrong table property 'skip.header.line.count'='1' inherited from other CSV format tables that makes this strong behaviour on the ORC external table : missing lines, empty counts (due to null values ?), ...