Hello,
I have a simple query of an ORC table which selects a relatively small number of rows from a 10 billion row table. The query is of this form:
select * from <table> where <col>=<value>
On Hive using Tez it runs in a few seconds. However, using Spark SQL it takes about 5 minutes. Based on everything I see it sure seems like Spark is sweeping through the entire table. I've even set spark.sql.orc.filterPushdown=true, but it doesn't help.
Is it reasonable to expect that Spark SQL's performance should be close to that of Hive's?
I'm running HDP 2.6.0.3 using Spark 2.1.0.
Thanks,
Jerrell